• Home
  • History
  • Annotate
Name Date Size #Lines LOC

..03-May-2022-

lib/HTML/H23-Jun-2007-467274

t/H23-Jun-2007-7351

ChangesH A D23-Jun-2007300 107

MANIFESTH A D23-Mar-2007160 1110

META.ymlH A D23-Jun-2007445 1412

Makefile.PLH A D23-Jun-2007603 1917

READMEH A D26-Mar-20071.7 KiB4526

README

1HTML-ContentExtractor
2
3version 0.02
4
5Web pages often contain clutter (such as ads, unnecessary images and extraneous links) around the body of an article that distracts a user from actual content. This module is used to reduce the noise content in web pages and thus identify the content rich regions.
6
7A web page is first parsed by an HTML parser, which corrects the markup and creates a DOM (Document Object Model) tree. By using a depth-first traversal to navigate the DOM tree, noise nodes are identified and removed, thus the main content is extracted. Some useless nodes (script, style, etc.) are removed; the container nodes (table, div, etc.) which have high link/text ratio (higher than threshold) are removed; (link/text ratio is the ratio of the number of links and non-linked words.) The nodes contain any string in the predefined spam string list are removed.
8
9INSTALLATION
10
11To install this module, run the following commands:
12
13    perl Makefile.PL
14    make
15    make test
16    make install
17
18
19SUPPORT AND DOCUMENTATION
20
21After installing, you can find documentation for this module with the perldoc command.
22
23    perldoc HTML::ContentExtractor
24
25You can also look for information at:
26
27    Search CPAN
28        http://search.cpan.org/dist/HTML-ContentExtractor
29
30    CPAN Request Tracker:
31        http://rt.cpan.org/NoAuth/Bugs.html?Dist=HTML-ContentExtractor
32
33    AnnoCPAN, annotated CPAN documentation:
34        http://annocpan.org/dist/HTML-ContentExtractor
35
36    CPAN Ratings:
37        http://cpanratings.perl.org/d/HTML-ContentExtractor
38
39COPYRIGHT AND LICENCE
40
41Copyright (C) 2007 Zhang Jun
42
43This program is free software; you can redistribute it and/or modify it
44under the same terms as Perl itself.
45