• Home
  • History
  • Annotate
Name Date Size #Lines LOC

..03-May-2022-

lib/HTML/H03-May-2022-443296

t/H03-May-2022-6247

Build.PLH A D10-Mar-2015301 134

ChangesH A D10-Mar-2015952 2826

LICENSEH A D10-Mar-201518 KiB379292

MANIFESTH A D10-Mar-2015271 2121

META.jsonH A D10-Mar-20152.1 KiB8584

META.ymlH A D10-Mar-20151.1 KiB4746

Makefile.PLH A D10-Mar-2015299 1611

README.mdH A D10-Mar-20152.3 KiB8652

cpanfileH A D10-Mar-2015215 119

minil.tomlH A D10-Mar-2015106 64

README.md

1[![Build Status](https://travis-ci.org/tarao/perl5-HTML-ExtractContent.svg?branch=master)](https://travis-ci.org/tarao/perl5-HTML-ExtractContent)
2# NAME
3
4HTML::ExtractContent - An HTML content extractor with scoring heuristics
5
6# SYNOPSIS
7
8    use HTML::ExtractContent;
9    use LWP::UserAgent;
10
11    my $agent = LWP::UserAgent->new;
12    my $res = $agent->get('http://www.example.com/');
13
14    my $extractor = HTML::ExtractContent->new;
15    $extractor->extract($res->decoded_content);
16    print $extractor->as_text;
17
18# DESCRIPTION
19
20HTML::ExtractContent is a module for extracting content from HTML with scoring
21heuristics. It guesses which block of HTML looks like content according to
22scores depending on the amount of punctuation marks and the lengths of non-tag
23texts. It also guesses whether content end in the block or continue to the
24next block.
25
26# METHODS
27
28- new
29
30        $extractor = HTML::ExtractContent->new;
31
32    Creates a new HTML::ExtractContent instance.
33
34- extract
35
36        $extractor->extract($html);
37
38    Extracts content from `$html`.
39    `$html` must have its UTF-8 flag on.
40
41- as\_text
42
43        $extractor->extract($html)->as_text;
44
45    Returns extracted content as a plain text. All tags are eliminated.
46
47- as\_html
48
49        $extractor->extract($html)->as_html;
50
51    Returns extracted content as an HTML text.
52    Note that the returned text is neither fully tagged nor valid HTML.
53    It doesn't contain tags such as <html> and it may have block tags that are
54    not closed, or closed but not opened.
55    This method is intended for the case that you need to analyse link tags in
56    the text for example.
57
58# ACKNOWLEDGMENT
59
60Hiromichi Kishi contributed towards development of this module
61as a partner of pair programming.
62
63Implementation of this module is based on the Ruby module ExtractContent by
64Nakatani Shuyo.
65
66# AUTHOR
67
68INA Lintaro <tarao at cpan.org>
69
70# COPYRIGHT
71
72Copyright (C) 2008 INA Lintaro / Hatena. All rights reserved.
73
74## Copyright of the original implementation
75
76Copyright (c) 2007/2008 Nakatani Shuyo / Cybozu Labs Inc. All rights reserved.
77
78# LICENCE
79
80This library is free software; you can redistribute it and/or modify it under
81the same terms as Perl itself.
82
83# SEE ALSO
84
85[http://rubyforge.org/projects/extractcontent/](http://rubyforge.org/projects/extractcontent/)
86