• Home
  • History
  • Annotate
Name Date Size #Lines LOC

..03-May-2022-

bin/H03-May-2022-11393

examples/H03-May-2022-2320

lib/HTML/HTML5/H03-May-2022-21,03115,758

t/H03-May-2022-17,69516,112

COPYRIGHTH A D08-Sep-20213.8 KiB145127

CREDITSH A D08-Sep-2021337 2117

ChangesH A D08-Sep-20215.7 KiB230165

INSTALLH A D08-Sep-2021982 3923

LICENSEH A D08-Sep-202117.9 KiB380292

MANIFESTH A D08-Sep-20212.3 KiB9493

META.jsonH A D08-Sep-20215.1 KiB173172

META.ymlH A D08-Sep-20213.4 KiB112111

Makefile.PLH A D08-Sep-20219.8 KiB181166

NEWSH A D08-Sep-2021329 148

READMEH A D08-Sep-202112.4 KiB327242

SIGNATUREH A D08-Sep-20219.5 KiB115108

TODOH A D08-Sep-2021115 32

dist.iniH A D08-Sep-202166 32

doap.ttlH A D08-Sep-202125.3 KiB550513

README

1NAME
2    HTML::HTML5::Parser - parse HTML reliably
3
4SYNOPSIS
5      use HTML::HTML5::Parser;
6
7      my $parser = HTML::HTML5::Parser->new;
8      my $doc    = $parser->parse_string(<<'EOT');
9      <!doctype html>
10      <title>Foo</title>
11      <p><b><i>Foo</b> bar</i>.
12      <p>Baz</br>Quux.
13      EOT
14
15      my $fdoc   = $parser->parse_file( $html_file_name );
16      my $fhdoc  = $parser->parse_fh( $html_file_handle );
17
18DESCRIPTION
19    This library is substantially the same as the non-CPAN module
20    Whatpm::HTML. Changes include:
21
22    *       Provides an XML::LibXML-like DOM interface. If you usually use
23            XML::LibXML's DOM parser, this should be a drop-in solution for
24            tag soup HTML.
25
26    *       Constructs an XML::LibXML::Document as the result of parsing.
27
28    *       Via bundling and modifications, removed external dependencies on
29            non-CPAN packages.
30
31  Constructor
32    `new`
33              $parser = HTML::HTML5::Parser->new;
34              # or
35              $parser = HTML::HTML5::Parser->new(no_cache => 1);
36
37            The constructor does nothing interesting besides take one flag
38            argument, `no_cache => 1`, to disable the global element metadata
39            cache. Disabling the cache is handy for conserving memory if you
40            parse a large number of documents, however, class methods such as
41            `/source_line` will not work, and must be run from an instance of
42            this parser.
43
44  XML::LibXML-Compatible Methods
45    `parse_file`, `parse_html_file`
46          $doc = $parser->parse_file( $html_file_name [,\%opts] );
47
48        This function parses an HTML document from a file or network;
49        $html_file_name can be either a filename or an URL.
50
51        Options include 'encoding' to indicate file encoding (e.g. 'utf-8')
52        and 'user_agent' which should be a blessed `LWP::UserAgent` (or
53        HTTP::Tiny) object to be used when retrieving URLs.
54
55        If requesting a URL and the response Content-Type header indicates an
56        XML-based media type (such as XHTML), XML::LibXML::Parser will be used
57        automatically (instead of the tag soup parser). The XML parser can be
58        told to use a DTD catalogue by setting the option 'xml_catalogue' to
59        the filename of the catalogue.
60
61        HTML (tag soup) parsing can be forced using the option 'force_html',
62        even when an XML media type is returned. If an options hashref was
63        passed, parse_file will set $options->{'parser_used'} to the name of
64        the class used to parse the URL, to allow the calling code to
65        double-check which parser was used afterwards.
66
67        If an options hashref was passed, parse_file will set
68        $options->{'response'} to the HTTP::Response object obtained by
69        retrieving the URI.
70
71    `parse_fh`, `parse_html_fh`
72          $doc = $parser->parse_fh( $io_fh [,\%opts] );
73
74        `parse_fh()` parses a IOREF or a subclass of `IO::Handle`.
75
76        Options include 'encoding' to indicate file encoding (e.g. 'utf-8').
77
78    `parse_string`, `parse_html_string`
79          $doc = $parser->parse_string( $html_string [,\%opts] );
80
81        This function is similar to `parse_fh()`, but it parses an HTML
82        document that is available as a single string in memory.
83
84        Options include 'encoding' to indicate file encoding (e.g. 'utf-8').
85
86    `load_xml`, `load_html`
87        Wrappers for the parse_* functions. These should be roughly compatible
88        with the equivalently named functions in XML::LibXML.
89
90        Note that `load_xml` first attempts to parse as real XML, falling back
91        to HTML5 parsing; `load_html` just goes straight for HTML5.
92
93    `parse_balanced_chunk`
94          $fragment = $parser->parse_balanced_chunk( $string [,\%opts] );
95
96        This method is roughly equivalent to XML::LibXML's method of the same
97        name, but unlike XML::LibXML, and despite its name it does not require
98        the chunk to be "balanced". This method is somewhat black magic, but
99        should work, and do the proper thing in most cases. Of course, the
100        proper thing might not be what you'd expect! I'll try to keep this
101        explanation as brief as possible...
102
103        Consider the following string:
104
105          <b>Hello</b></td></tr> <i>World</i>
106
107        What is the proper way to parse that? If it were found in a document
108        like this:
109
110          <html>
111            <head><title>X</title></head>
112            <body>
113              <div>
114                <b>Hello</b></td></tr> <i>World</i>
115              </div>
116            </body>
117          </html>
118
119        Then the document would end up equivalent to the following XHTML:
120
121          <html>
122            <head><title>X</title></head>
123            <body>
124              <div>
125                <b>Hello</b> <i>World</i>
126              </div>
127            </body>
128          </html>
129
130        The superfluous `</td></tr>` is simply ignored. However, if it were
131        found in a document like this:
132
133          <html>
134            <head><title>X</title></head>
135            <body>
136              <table><tbody><tr><td>
137                <b>Hello</b></td></tr> <i>World</i>
138              </td></tr></tbody></table>
139            </body>
140          </html>
141
142        Then the result would be:
143
144          <html>
145            <head><title>X</title></head>
146            <body>
147              <i>World</i>
148              <table><tbody><tr><td>
149                <b>Hello</b></td></tr>
150              </tbody></table>
151            </body>
152          </html>
153
154        Yes, `<i>World</i>` gets hoisted up before the `<table>`. This is
155        weird, I know, but it's how browsers do it in real life.
156
157        So what should:
158
159          $string   = q{<b>Hello</b></td></tr> <i>World</i>};
160          $fragment = $parser->parse_balanced_chunk($string);
161
162        actually return? Well, you can choose...
163
164          $string = q{<b>Hello</b></td></tr> <i>World</i>};
165
166          $frag1  = $parser->parse_balanced_chunk($string, {within=>'div'});
167          say $frag1->toString; # <b>Hello</b> <i>World</i>
168
169          $frag2  = $parser->parse_balanced_chunk($string, {within=>'td'});
170          say $frag2->toString; # <i>World</i><b>Hello</b>
171
172        If you don't pass a "within" option, then the chunk is parsed as if it
173        were within a `<div>` element. This is often the most sensible option.
174        If you pass something like `{ within => "foobar" }` where "foobar" is
175        not a real HTML element name (as found in the HTML5 spec), then this
176        method will croak; if you pass the name of a void element (e.g. "br"
177        or "meta") then this method will croak; there are a handful of other
178        unsupported elements which will croak (namely: "noscript", "noembed",
179        "noframes").
180
181        Note that the second time around, although we parsed the string "as if
182        it were within a `<td>` element", the `<i>Hello</i>` bit did not
183        strictly end up within the `<td>` element (not even within the
184        `<table>` element!) yet it still gets returned. We'll call things such
185        as this "outliers". There is a "force_within" option which tells
186        parse_balanced_chunk to ignore outliers:
187
188          $frag3  = $parser->parse_balanced_chunk($string,
189                                                  {force_within=>'td'});
190          say $frag3->toString; # <b>Hello</b>
191
192        There is a boolean option "mark_outliers" which marks each outlier
193        with an attribute (`data-perl-html-html5-parser-outlier`) to indicate
194        its outlier status. Clearly, this is ignored when you use
195        "force_within" because no outliers are returned. Some outliers may be
196        XML::LibXML::Text elements; text nodes don't have attributes, so these
197        will not be marked with an attribute.
198
199        A last note is to mention what gets returned by this method. Normally
200        it's an XML::LibXML::DocumentFragment object, but if you call the
201        method in list context, a list of the individual node elements is
202        returned. Alternatively you can request the data to be returned as an
203        XML::LibXML::NodeList object:
204
205         # Get an XML::LibXML::NodeList
206         my $list = $parser->parse_balanced_chunk($str, {as=>'list'});
207
208        The exact implementation of this method may change from version to
209        version, but the long-term goal will be to approach how common desktop
210        browsers parse HTML fragments when implementing the setter for DOM's
211        `innerHTML` attribute.
212
213    The push parser and SAX-based parser are not supported. Trying to change
214    an option (such as recover_silently) will make HTML::HTML5::Parser carp a
215    warning. (But you can inspect the options.)
216
217  Error Handling
218    Error handling is obviously different to XML::LibXML, as errors are (bugs
219    notwithstanding) non-fatal.
220
221    `error_handler`
222        Get/set an error handling function. Must be set to a coderef or undef.
223
224        The error handling function will be called with a single parameter, a
225        HTML::HTML5::Parser::Error object.
226
227    `errors`
228        Returns a list of errors that occurred during the last parse.
229
230        See HTML::HTML5::Parser::Error.
231
232  Additional Methods
233    The module provides a few methods to obtain additional, non-DOM data from
234    DOM nodes.
235
236    `dtd_public_id`
237          $pubid = $parser->dtd_public_id( $doc );
238
239        For an XML::LibXML::Document which has been returned by
240        HTML::HTML5::Parser, using this method will tell you the Public
241        Identifier of the DTD used (if any).
242
243    `dtd_system_id`
244          $sysid = $parser->dtd_system_id( $doc );
245
246        For an XML::LibXML::Document which has been returned by
247        HTML::HTML5::Parser, using this method will tell you the System
248        Identifier of the DTD used (if any).
249
250    `dtd_element`
251          $element = $parser->dtd_element( $doc );
252
253        For an XML::LibXML::Document which has been returned by
254        HTML::HTML5::Parser, using this method will tell you the root element
255        declared in the DTD used (if any). That is, if the document has this
256        doctype:
257
258          <!doctype html>
259
260        ... it will return "html".
261
262        This may return the empty string if a DTD was present but did not
263        contain a root element; or undef if no DTD was present.
264
265    `compat_mode`
266          $mode = $parser->compat_mode( $doc );
267
268        Returns 'quirks', 'limited quirks' or undef (standards mode).
269
270    `charset`
271          $charset = $parser->charset( $doc );
272
273        The character set apparently used by the document.
274
275    `source_line`
276          ($line, $col) = $parser->source_line( $node );
277          $line = $parser->source_line( $node );
278
279        In scalar context, `source_line` returns the line number of the source
280        code that started a particular node (element, attribute or comment).
281
282        In list context, returns a tuple: $line, $column, $implicitness. Tab
283        characters count as one column, not eight.
284
285        $implicitness indicates that the node was not explicitly marked up in
286        the source code, but its existence was inferred by the parser. For
287        example, in the following markup, the HTML, TITLE and P elements are
288        explicit, but the HEAD and BODY elements are implicit.
289
290         <html>
291          <title>I have an implicit head</title>
292          <p>And an implicit body too!</p>
293         </html>
294
295        (Note that implicit elements do still have a line number and column
296        number.) The implictness indicator is a new feature, and I'd
297        appreciate any bug reports where it gets things wrong.
298
299        XML::LibXML::Node has a `line_number` method. In general this will
300        always return 0 and HTML::HTML5::Parser has no way of influencing it.
301        However, if you install XML::LibXML::Devel::SetLineNumber on your
302        system, the `line_number` method will start working (at least for
303        elements).
304
305SEE ALSO
306    <http://suika.fam.cx/www/markup/html/whatpm/Whatpm/HTML.html>.
307
308    HTML::HTML5::Writer, HTML::HTML5::Builder, XML::LibXML,
309    XML::LibXML::PrettyPrint, XML::LibXML::Devel::SetLineNumber.
310
311AUTHOR
312    Toby Inkster, <tobyink@cpan.org>
313
314COPYRIGHT AND LICENCE
315    Copyright (C) 2007-2011 by Wakaba
316
317    Copyright (C) 2009-2012 by Toby Inkster
318
319    This library is free software; you can redistribute it and/or modify it
320    under the same terms as Perl itself.
321
322DISCLAIMER OF WARRANTIES
323    THIS PACKAGE IS PROVIDED "AS IS" AND WITHOUT ANY EXPRESS OR IMPLIED
324    WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED WARRANTIES OF
325    MERCHANTIBILITY AND FITNESS FOR A PARTICULAR PURPOSE.
326
327