README
1NAME
2 Lingua::EN::Summarize - A simple tool for summarizing bodies of English
3 text.
4
5SYNOPSIS
6 use Lingua::EN::Summarize;
7 my $summary = summarize( $text ); # Easy, no? :-)
8 my $summary = summarize( $text, maxlength => 500 ); # 500-byte summary
9 my $summary = summarize( $text, filter => 'html' ); # Strip HTML formatting
10 my $summary = summarize( $text, wrap => 75 ); # Wrap output to 75 col.
11
12DESCRIPTION
13 This is a simple module which makes an unscientific effort at
14 summarizing English text. It recognizes simple patterns which look like
15 statements, abridges them, and concatenates them into something vaguely
16 resembling a summary. It needs more work on large bodies of text, but it
17 seems to have a decent effect on small inputs at the moment.
18
19 Lingua::EN::Summarize exports one function, "summarize()", which takes
20 the text to summarize as its first argument, and any number of optional
21 directives in "name => value" form. The options it'll take are:
22
23 maxlength
24 Specifies the maximum length, in bytes, of the generated summary.
25
26 wrap
27 Prettyprints the summary output by wrapping it to the number of
28 columns which you specify.
29
30 filter
31 Passes the text through a filter before handing it to the
32 summarizer. Currently, only two filters are implemented: ""html"",
33 which uses HTML::TreeBuilder and HTML::FormatText to strip all HTML
34 formatting from a document, and ""easyhtml"", which quickly (and
35 less accurately) strips all HTML from a document using a simple
36 regular expression, if you don't have the abovementioned modules. An
37 ""email"" filter, for converting mail and news messages to
38 easily-summarizable text, is in the works for the next version.
39
40 Unlike the HTML::Summarize module (which is very cool, and worth a
41 look), this module considers its input to be plain English text, and
42 doesn't try to gather any information from the formatting. Thus, without
43 any cues from the document's format, the scheme that HTML::Summarize
44 uses isn't applicable here. The current scheme goes something like this:
45
46 "Filter the text according to the user's "filter" option. Split the text
47 into discrete sentences with the Text::Sentence module, then further
48 split them into clauses on commas and semicolons. Keep only the ones
49 that have a (subject very-simple-verb object) structure. Construct the
50 summary out of the first sentences in the list, staying within the
51 "maxlength" limit, or under 30% of the size of the original text,
52 whichever is smaller."
53
54 Needless to say, this is a very simple and not terribly universally
55 effective scheme, but it's good enough for a first draft, and I'll bang
56 on it more later. Like I said, it's not a scientific approach to the
57 problem, but it's better than nothing, and I don't really need A.I.
58 quality output from it.
59
60AUTHOR
61 Dennis Taylor, <dennis@funkplanet.com>
62
63SEE ALSO
64 HTML::Summarize, Text::Sentence,
65 http://www.vancouvertoday.com/city_guide/dining/reviews/barbers_modern_c
66 lub.html
67
68