1NAME
2 HTML::HTML5::Parser - parse HTML reliably
3
4SYNOPSIS
5 use HTML::HTML5::Parser;
6
7 my $parser = HTML::HTML5::Parser->new;
8 my $doc = $parser->parse_string(<<'EOT');
9 <!doctype html>
10 <title>Foo</title>
11 <p><b><i>Foo</b> bar</i>.
12 <p>Baz</br>Quux.
13 EOT
14
15 my $fdoc = $parser->parse_file( $html_file_name );
16 my $fhdoc = $parser->parse_fh( $html_file_handle );
17
18DESCRIPTION
19 This library is substantially the same as the non-CPAN module
20 Whatpm::HTML. Changes include:
21
22 * Provides an XML::LibXML-like DOM interface. If you usually use
23 XML::LibXML's DOM parser, this should be a drop-in solution for
24 tag soup HTML.
25
26 * Constructs an XML::LibXML::Document as the result of parsing.
27
28 * Via bundling and modifications, removed external dependencies on
29 non-CPAN packages.
30
31 Constructor
32 `new`
33 $parser = HTML::HTML5::Parser->new;
34 # or
35 $parser = HTML::HTML5::Parser->new(no_cache => 1);
36
37 The constructor does nothing interesting besides take one flag
38 argument, `no_cache => 1`, to disable the global element metadata
39 cache. Disabling the cache is handy for conserving memory if you
40 parse a large number of documents, however, class methods such as
41 `/source_line` will not work, and must be run from an instance of
42 this parser.
43
44 XML::LibXML-Compatible Methods
45 `parse_file`, `parse_html_file`
46 $doc = $parser->parse_file( $html_file_name [,\%opts] );
47
48 This function parses an HTML document from a file or network;
49 $html_file_name can be either a filename or an URL.
50
51 Options include 'encoding' to indicate file encoding (e.g. 'utf-8')
52 and 'user_agent' which should be a blessed `LWP::UserAgent` (or
53 HTTP::Tiny) object to be used when retrieving URLs.
54
55 If requesting a URL and the response Content-Type header indicates an
56 XML-based media type (such as XHTML), XML::LibXML::Parser will be used
57 automatically (instead of the tag soup parser). The XML parser can be
58 told to use a DTD catalogue by setting the option 'xml_catalogue' to
59 the filename of the catalogue.
60
61 HTML (tag soup) parsing can be forced using the option 'force_html',
62 even when an XML media type is returned. If an options hashref was
63 passed, parse_file will set $options->{'parser_used'} to the name of
64 the class used to parse the URL, to allow the calling code to
65 double-check which parser was used afterwards.
66
67 If an options hashref was passed, parse_file will set
68 $options->{'response'} to the HTTP::Response object obtained by
69 retrieving the URI.
70
71 `parse_fh`, `parse_html_fh`
72 $doc = $parser->parse_fh( $io_fh [,\%opts] );
73
74 `parse_fh()` parses a IOREF or a subclass of `IO::Handle`.
75
76 Options include 'encoding' to indicate file encoding (e.g. 'utf-8').
77
78 `parse_string`, `parse_html_string`
79 $doc = $parser->parse_string( $html_string [,\%opts] );
80
81 This function is similar to `parse_fh()`, but it parses an HTML
82 document that is available as a single string in memory.
83
84 Options include 'encoding' to indicate file encoding (e.g. 'utf-8').
85
86 `load_xml`, `load_html`
87 Wrappers for the parse_* functions. These should be roughly compatible
88 with the equivalently named functions in XML::LibXML.
89
90 Note that `load_xml` first attempts to parse as real XML, falling back
91 to HTML5 parsing; `load_html` just goes straight for HTML5.
92
93 `parse_balanced_chunk`
94 $fragment = $parser->parse_balanced_chunk( $string [,\%opts] );
95
96 This method is roughly equivalent to XML::LibXML's method of the same
97 name, but unlike XML::LibXML, and despite its name it does not require
98 the chunk to be "balanced". This method is somewhat black magic, but
99 should work, and do the proper thing in most cases. Of course, the
100 proper thing might not be what you'd expect! I'll try to keep this
101 explanation as brief as possible...
102
103 Consider the following string:
104
105 <b>Hello</b></td></tr> <i>World</i>
106
107 What is the proper way to parse that? If it were found in a document
108 like this:
109
110 <html>
111 <head><title>X</title></head>
112 <body>
113 <div>
114 <b>Hello</b></td></tr> <i>World</i>
115 </div>
116 </body>
117 </html>
118
119 Then the document would end up equivalent to the following XHTML:
120
121 <html>
122 <head><title>X</title></head>
123 <body>
124 <div>
125 <b>Hello</b> <i>World</i>
126 </div>
127 </body>
128 </html>
129
130 The superfluous `</td></tr>` is simply ignored. However, if it were
131 found in a document like this:
132
133 <html>
134 <head><title>X</title></head>
135 <body>
136 <table><tbody><tr><td>
137 <b>Hello</b></td></tr> <i>World</i>
138 </td></tr></tbody></table>
139 </body>
140 </html>
141
142 Then the result would be:
143
144 <html>
145 <head><title>X</title></head>
146 <body>
147 <i>World</i>
148 <table><tbody><tr><td>
149 <b>Hello</b></td></tr>
150 </tbody></table>
151 </body>
152 </html>
153
154 Yes, `<i>World</i>` gets hoisted up before the `<table>`. This is
155 weird, I know, but it's how browsers do it in real life.
156
157 So what should:
158
159 $string = q{<b>Hello</b></td></tr> <i>World</i>};
160 $fragment = $parser->parse_balanced_chunk($string);
161
162 actually return? Well, you can choose...
163
164 $string = q{<b>Hello</b></td></tr> <i>World</i>};
165
166 $frag1 = $parser->parse_balanced_chunk($string, {within=>'div'});
167 say $frag1->toString; # <b>Hello</b> <i>World</i>
168
169 $frag2 = $parser->parse_balanced_chunk($string, {within=>'td'});
170 say $frag2->toString; # <i>World</i><b>Hello</b>
171
172 If you don't pass a "within" option, then the chunk is parsed as if it
173 were within a `<div>` element. This is often the most sensible option.
174 If you pass something like `{ within => "foobar" }` where "foobar" is
175 not a real HTML element name (as found in the HTML5 spec), then this
176 method will croak; if you pass the name of a void element (e.g. "br"
177 or "meta") then this method will croak; there are a handful of other
178 unsupported elements which will croak (namely: "noscript", "noembed",
179 "noframes").
180
181 Note that the second time around, although we parsed the string "as if
182 it were within a `<td>` element", the `<i>Hello</i>` bit did not
183 strictly end up within the `<td>` element (not even within the
184 `<table>` element!) yet it still gets returned. We'll call things such
185 as this "outliers". There is a "force_within" option which tells
186 parse_balanced_chunk to ignore outliers:
187
188 $frag3 = $parser->parse_balanced_chunk($string,
189 {force_within=>'td'});
190 say $frag3->toString; # <b>Hello</b>
191
192 There is a boolean option "mark_outliers" which marks each outlier
193 with an attribute (`data-perl-html-html5-parser-outlier`) to indicate
194 its outlier status. Clearly, this is ignored when you use
195 "force_within" because no outliers are returned. Some outliers may be
196 XML::LibXML::Text elements; text nodes don't have attributes, so these
197 will not be marked with an attribute.
198
199 A last note is to mention what gets returned by this method. Normally
200 it's an XML::LibXML::DocumentFragment object, but if you call the
201 method in list context, a list of the individual node elements is
202 returned. Alternatively you can request the data to be returned as an
203 XML::LibXML::NodeList object:
204
205 # Get an XML::LibXML::NodeList
206 my $list = $parser->parse_balanced_chunk($str, {as=>'list'});
207
208 The exact implementation of this method may change from version to
209 version, but the long-term goal will be to approach how common desktop
210 browsers parse HTML fragments when implementing the setter for DOM's
211 `innerHTML` attribute.
212
213 The push parser and SAX-based parser are not supported. Trying to change
214 an option (such as recover_silently) will make HTML::HTML5::Parser carp a
215 warning. (But you can inspect the options.)
216
217 Error Handling
218 Error handling is obviously different to XML::LibXML, as errors are (bugs
219 notwithstanding) non-fatal.
220
221 `error_handler`
222 Get/set an error handling function. Must be set to a coderef or undef.
223
224 The error handling function will be called with a single parameter, a
225 HTML::HTML5::Parser::Error object.
226
227 `errors`
228 Returns a list of errors that occurred during the last parse.
229
230 See HTML::HTML5::Parser::Error.
231
232 Additional Methods
233 The module provides a few methods to obtain additional, non-DOM data from
234 DOM nodes.
235
236 `dtd_public_id`
237 $pubid = $parser->dtd_public_id( $doc );
238
239 For an XML::LibXML::Document which has been returned by
240 HTML::HTML5::Parser, using this method will tell you the Public
241 Identifier of the DTD used (if any).
242
243 `dtd_system_id`
244 $sysid = $parser->dtd_system_id( $doc );
245
246 For an XML::LibXML::Document which has been returned by
247 HTML::HTML5::Parser, using this method will tell you the System
248 Identifier of the DTD used (if any).
249
250 `dtd_element`
251 $element = $parser->dtd_element( $doc );
252
253 For an XML::LibXML::Document which has been returned by
254 HTML::HTML5::Parser, using this method will tell you the root element
255 declared in the DTD used (if any). That is, if the document has this
256 doctype:
257
258 <!doctype html>
259
260 ... it will return "html".
261
262 This may return the empty string if a DTD was present but did not
263 contain a root element; or undef if no DTD was present.
264
265 `compat_mode`
266 $mode = $parser->compat_mode( $doc );
267
268 Returns 'quirks', 'limited quirks' or undef (standards mode).
269
270 `charset`
271 $charset = $parser->charset( $doc );
272
273 The character set apparently used by the document.
274
275 `source_line`
276 ($line, $col) = $parser->source_line( $node );
277 $line = $parser->source_line( $node );
278
279 In scalar context, `source_line` returns the line number of the source
280 code that started a particular node (element, attribute or comment).
281
282 In list context, returns a tuple: $line, $column, $implicitness. Tab
283 characters count as one column, not eight.
284
285 $implicitness indicates that the node was not explicitly marked up in
286 the source code, but its existence was inferred by the parser. For
287 example, in the following markup, the HTML, TITLE and P elements are
288 explicit, but the HEAD and BODY elements are implicit.
289
290 <html>
291 <title>I have an implicit head</title>
292 <p>And an implicit body too!</p>
293 </html>
294
295 (Note that implicit elements do still have a line number and column
296 number.) The implictness indicator is a new feature, and I'd
297 appreciate any bug reports where it gets things wrong.
298
299 XML::LibXML::Node has a `line_number` method. In general this will
300 always return 0 and HTML::HTML5::Parser has no way of influencing it.
301 However, if you install XML::LibXML::Devel::SetLineNumber on your
302 system, the `line_number` method will start working (at least for
303 elements).
304
305SEE ALSO
306 <http://suika.fam.cx/www/markup/html/whatpm/Whatpm/HTML.html>.
307
308 HTML::HTML5::Writer, HTML::HTML5::Builder, XML::LibXML,
309 XML::LibXML::PrettyPrint, XML::LibXML::Devel::SetLineNumber.
310
311AUTHOR
312 Toby Inkster, <tobyink@cpan.org>
313
314COPYRIGHT AND LICENCE
315 Copyright (C) 2007-2011 by Wakaba
316
317 Copyright (C) 2009-2012 by Toby Inkster
318
319 This library is free software; you can redistribute it and/or modify it
320 under the same terms as Perl itself.
321
322DISCLAIMER OF WARRANTIES
323 THIS PACKAGE IS PROVIDED "AS IS" AND WITHOUT ANY EXPRESS OR IMPLIED
324 WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED WARRANTIES OF
325 MERCHANTIBILITY AND FITNESS FOR A PARTICULAR PURPOSE.
326
327