www/p5-HTML-ContentExtractor/HTML-ContentExtractor-0.03

HTML-ContentExtractor

version 0.02

Web pages often contain clutter (such as ads, unnecessary images and extraneous links) around the body of an article that distracts a user from actual content. This module is used to reduce the noise content in web pages and thus identify the content rich regions.

A web page is first parsed by an HTML parser, which corrects the markup and creates a DOM (Document Object Model) tree. By using a depth-first traversal to navigate the DOM tree, noise nodes are identified and removed, thus the main content is extracted. Some useless nodes (script, style, etc.) are removed; the container nodes (table, div, etc.) which have high link/text ratio (higher than threshold) are removed; (link/text ratio is the ratio of the number of links and non-linked words.) The nodes contain any string in the predefined spam string list are removed.

INSTALLATION

To install this module, run the following commands:

    perl Makefile.PL
    make
    make test
    make install


SUPPORT AND DOCUMENTATION

After installing, you can find documentation for this module with the perldoc command.

    perldoc HTML::ContentExtractor

You can also look for information at:

    Search CPAN
        http://search.cpan.org/dist/HTML-ContentExtractor

    CPAN Request Tracker:
        http://rt.cpan.org/NoAuth/Bugs.html?Dist=HTML-ContentExtractor

    AnnoCPAN, annotated CPAN documentation:
        http://annocpan.org/dist/HTML-ContentExtractor

    CPAN Ratings:
        http://cpanratings.perl.org/d/HTML-ContentExtractor

COPYRIGHT AND LICENCE

Copyright (C) 2007 Zhang Jun

This program is free software; you can redistribute it and/or modify it
under the same terms as Perl itself.
Name		Date	Size	#Lines	LOC
..		03-May-2022	-
lib/HTML/	H	23-Jun-2007	-	467	274
t/	H	23-Jun-2007	-	73	51
Changes	H A D	23-Jun-2007	300	10	7
MANIFEST	H A D	23-Mar-2007	160	11	10
META.yml	H A D	23-Jun-2007	445	14	12
Makefile.PL	H A D	23-Jun-2007	603	19	17
README	H A D	26-Mar-2007	1.7 KiB	45	26