• Home
  • History
  • Annotate
Name Date Size #Lines LOC

..03-May-2022-

ChangesH A D20-Feb-2001579 1511

Filters.pmH A D20-Feb-2001976 6925

MANIFESTH A D20-Feb-200168 87

Makefile.PLH A D20-Feb-2001500 1714

READMEH A D20-Feb-20013.1 KiB6855

Summarize.pmH A D20-Feb-20015.2 KiB17460

test.plH A D20-Feb-200124.4 KiB530412

README

1NAME
2    Lingua::EN::Summarize - A simple tool for summarizing bodies of English
3    text.
4
5SYNOPSIS
6      use Lingua::EN::Summarize;
7      my $summary = summarize( $text );                    # Easy, no? :-)
8      my $summary = summarize( $text, maxlength => 500 );  # 500-byte summary
9      my $summary = summarize( $text, filter => 'html' );  # Strip HTML formatting
10      my $summary = summarize( $text, wrap => 75 );        # Wrap output to 75 col.
11
12DESCRIPTION
13    This is a simple module which makes an unscientific effort at
14    summarizing English text. It recognizes simple patterns which look like
15    statements, abridges them, and concatenates them into something vaguely
16    resembling a summary. It needs more work on large bodies of text, but it
17    seems to have a decent effect on small inputs at the moment.
18
19    Lingua::EN::Summarize exports one function, "summarize()", which takes
20    the text to summarize as its first argument, and any number of optional
21    directives in "name => value" form. The options it'll take are:
22
23    maxlength
24        Specifies the maximum length, in bytes, of the generated summary.
25
26    wrap
27        Prettyprints the summary output by wrapping it to the number of
28        columns which you specify.
29
30    filter
31        Passes the text through a filter before handing it to the
32        summarizer. Currently, only two filters are implemented: ""html"",
33        which uses HTML::TreeBuilder and HTML::FormatText to strip all HTML
34        formatting from a document, and ""easyhtml"", which quickly (and
35        less accurately) strips all HTML from a document using a simple
36        regular expression, if you don't have the abovementioned modules. An
37        ""email"" filter, for converting mail and news messages to
38        easily-summarizable text, is in the works for the next version.
39
40    Unlike the HTML::Summarize module (which is very cool, and worth a
41    look), this module considers its input to be plain English text, and
42    doesn't try to gather any information from the formatting. Thus, without
43    any cues from the document's format, the scheme that HTML::Summarize
44    uses isn't applicable here. The current scheme goes something like this:
45
46    "Filter the text according to the user's "filter" option. Split the text
47    into discrete sentences with the Text::Sentence module, then further
48    split them into clauses on commas and semicolons. Keep only the ones
49    that have a (subject very-simple-verb object) structure. Construct the
50    summary out of the first sentences in the list, staying within the
51    "maxlength" limit, or under 30% of the size of the original text,
52    whichever is smaller."
53
54    Needless to say, this is a very simple and not terribly universally
55    effective scheme, but it's good enough for a first draft, and I'll bang
56    on it more later. Like I said, it's not a scientific approach to the
57    problem, but it's better than nothing, and I don't really need A.I.
58    quality output from it.
59
60AUTHOR
61    Dennis Taylor, <dennis@funkplanet.com>
62
63SEE ALSO
64    HTML::Summarize, Text::Sentence,
65    http://www.vancouvertoday.com/city_guide/dining/reviews/barbers_modern_c
66    lub.html
67
68