1:mod:`html.parser` --- Simple HTML and XHTML parser
2===================================================
3
4.. module:: html.parser
5   :synopsis: A simple parser that can handle HTML and XHTML.
6
7**Source code:** :source:`Lib/html/parser.py`
8
9.. index::
10   single: HTML
11   single: XHTML
12
13--------------
14
15This module defines a class :class:`HTMLParser` which serves as the basis for
16parsing text files formatted in HTML (HyperText Mark-up Language) and XHTML.
17
18.. class:: HTMLParser(*, convert_charrefs=True)
19
20   Create a parser instance able to parse invalid markup.
21
22   If *convert_charrefs* is ``True`` (the default), all character
23   references (except the ones in ``script``/``style`` elements) are
24   automatically converted to the corresponding Unicode characters.
25
26   An :class:`.HTMLParser` instance is fed HTML data and calls handler methods
27   when start tags, end tags, text, comments, and other markup elements are
28   encountered.  The user should subclass :class:`.HTMLParser` and override its
29   methods to implement the desired behavior.
30
31   This parser does not check that end tags match start tags or call the end-tag
32   handler for elements which are closed implicitly by closing an outer element.
33
34   .. versionchanged:: 3.4
35      *convert_charrefs* keyword argument added.
36
37   .. versionchanged:: 3.5
38      The default value for argument *convert_charrefs* is now ``True``.
39
40
41Example HTML Parser Application
42-------------------------------
43
44As a basic example, below is a simple HTML parser that uses the
45:class:`HTMLParser` class to print out start tags, end tags, and data
46as they are encountered::
47
48   from html.parser import HTMLParser
49
50   class MyHTMLParser(HTMLParser):
51       def handle_starttag(self, tag, attrs):
52           print("Encountered a start tag:", tag)
53
54       def handle_endtag(self, tag):
55           print("Encountered an end tag :", tag)
56
57       def handle_data(self, data):
58           print("Encountered some data  :", data)
59
60   parser = MyHTMLParser()
61   parser.feed('<html><head><title>Test</title></head>'
62               '<body><h1>Parse me!</h1></body></html>')
63
64The output will then be:
65
66.. code-block:: none
67
68   Encountered a start tag: html
69   Encountered a start tag: head
70   Encountered a start tag: title
71   Encountered some data  : Test
72   Encountered an end tag : title
73   Encountered an end tag : head
74   Encountered a start tag: body
75   Encountered a start tag: h1
76   Encountered some data  : Parse me!
77   Encountered an end tag : h1
78   Encountered an end tag : body
79   Encountered an end tag : html
80
81
82:class:`.HTMLParser` Methods
83----------------------------
84
85:class:`HTMLParser` instances have the following methods:
86
87
88.. method:: HTMLParser.feed(data)
89
90   Feed some text to the parser.  It is processed insofar as it consists of
91   complete elements; incomplete data is buffered until more data is fed or
92   :meth:`close` is called.  *data* must be :class:`str`.
93
94
95.. method:: HTMLParser.close()
96
97   Force processing of all buffered data as if it were followed by an end-of-file
98   mark.  This method may be redefined by a derived class to define additional
99   processing at the end of the input, but the redefined version should always call
100   the :class:`HTMLParser` base class method :meth:`close`.
101
102
103.. method:: HTMLParser.reset()
104
105   Reset the instance.  Loses all unprocessed data.  This is called implicitly at
106   instantiation time.
107
108
109.. method:: HTMLParser.getpos()
110
111   Return current line number and offset.
112
113
114.. method:: HTMLParser.get_starttag_text()
115
116   Return the text of the most recently opened start tag.  This should not normally
117   be needed for structured processing, but may be useful in dealing with HTML "as
118   deployed" or for re-generating input with minimal changes (whitespace between
119   attributes can be preserved, etc.).
120
121
122The following methods are called when data or markup elements are encountered
123and they are meant to be overridden in a subclass.  The base class
124implementations do nothing (except for :meth:`~HTMLParser.handle_startendtag`):
125
126
127.. method:: HTMLParser.handle_starttag(tag, attrs)
128
129   This method is called to handle the start of a tag (e.g. ``<div id="main">``).
130
131   The *tag* argument is the name of the tag converted to lower case. The *attrs*
132   argument is a list of ``(name, value)`` pairs containing the attributes found
133   inside the tag's ``<>`` brackets.  The *name* will be translated to lower case,
134   and quotes in the *value* have been removed, and character and entity references
135   have been replaced.
136
137   For instance, for the tag ``<A HREF="https://www.cwi.nl/">``, this method
138   would be called as ``handle_starttag('a', [('href', 'https://www.cwi.nl/')])``.
139
140   All entity references from :mod:`html.entities` are replaced in the attribute
141   values.
142
143
144.. method:: HTMLParser.handle_endtag(tag)
145
146   This method is called to handle the end tag of an element (e.g. ``</div>``).
147
148   The *tag* argument is the name of the tag converted to lower case.
149
150
151.. method:: HTMLParser.handle_startendtag(tag, attrs)
152
153   Similar to :meth:`handle_starttag`, but called when the parser encounters an
154   XHTML-style empty tag (``<img ... />``).  This method may be overridden by
155   subclasses which require this particular lexical information; the default
156   implementation simply calls :meth:`handle_starttag` and :meth:`handle_endtag`.
157
158
159.. method:: HTMLParser.handle_data(data)
160
161   This method is called to process arbitrary data (e.g. text nodes and the
162   content of ``<script>...</script>`` and ``<style>...</style>``).
163
164
165.. method:: HTMLParser.handle_entityref(name)
166
167   This method is called to process a named character reference of the form
168   ``&name;`` (e.g. ``&gt;``), where *name* is a general entity reference
169   (e.g. ``'gt'``).  This method is never called if *convert_charrefs* is
170   ``True``.
171
172
173.. method:: HTMLParser.handle_charref(name)
174
175   This method is called to process decimal and hexadecimal numeric character
176   references of the form ``&#NNN;`` and ``&#xNNN;``.  For example, the decimal
177   equivalent for ``&gt;`` is ``&#62;``, whereas the hexadecimal is ``&#x3E;``;
178   in this case the method will receive ``'62'`` or ``'x3E'``.  This method
179   is never called if *convert_charrefs* is ``True``.
180
181
182.. method:: HTMLParser.handle_comment(data)
183
184   This method is called when a comment is encountered (e.g. ``<!--comment-->``).
185
186   For example, the comment ``<!-- comment -->`` will cause this method to be
187   called with the argument ``' comment '``.
188
189   The content of Internet Explorer conditional comments (condcoms) will also be
190   sent to this method, so, for ``<!--[if IE 9]>IE9-specific content<![endif]-->``,
191   this method will receive ``'[if IE 9]>IE9-specific content<![endif]'``.
192
193
194.. method:: HTMLParser.handle_decl(decl)
195
196   This method is called to handle an HTML doctype declaration (e.g.
197   ``<!DOCTYPE html>``).
198
199   The *decl* parameter will be the entire contents of the declaration inside
200   the ``<!...>`` markup (e.g. ``'DOCTYPE html'``).
201
202
203.. method:: HTMLParser.handle_pi(data)
204
205   Method called when a processing instruction is encountered.  The *data*
206   parameter will contain the entire processing instruction. For example, for the
207   processing instruction ``<?proc color='red'>``, this method would be called as
208   ``handle_pi("proc color='red'")``.  It is intended to be overridden by a derived
209   class; the base class implementation does nothing.
210
211   .. note::
212
213      The :class:`HTMLParser` class uses the SGML syntactic rules for processing
214      instructions.  An XHTML processing instruction using the trailing ``'?'`` will
215      cause the ``'?'`` to be included in *data*.
216
217
218.. method:: HTMLParser.unknown_decl(data)
219
220   This method is called when an unrecognized declaration is read by the parser.
221
222   The *data* parameter will be the entire contents of the declaration inside
223   the ``<![...]>`` markup.  It is sometimes useful to be overridden by a
224   derived class.  The base class implementation does nothing.
225
226
227.. _htmlparser-examples:
228
229Examples
230--------
231
232The following class implements a parser that will be used to illustrate more
233examples::
234
235   from html.parser import HTMLParser
236   from html.entities import name2codepoint
237
238   class MyHTMLParser(HTMLParser):
239       def handle_starttag(self, tag, attrs):
240           print("Start tag:", tag)
241           for attr in attrs:
242               print("     attr:", attr)
243
244       def handle_endtag(self, tag):
245           print("End tag  :", tag)
246
247       def handle_data(self, data):
248           print("Data     :", data)
249
250       def handle_comment(self, data):
251           print("Comment  :", data)
252
253       def handle_entityref(self, name):
254           c = chr(name2codepoint[name])
255           print("Named ent:", c)
256
257       def handle_charref(self, name):
258           if name.startswith('x'):
259               c = chr(int(name[1:], 16))
260           else:
261               c = chr(int(name))
262           print("Num ent  :", c)
263
264       def handle_decl(self, data):
265           print("Decl     :", data)
266
267   parser = MyHTMLParser()
268
269Parsing a doctype::
270
271   >>> parser.feed('<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" '
272   ...             '"http://www.w3.org/TR/html4/strict.dtd">')
273   Decl     : DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"
274
275Parsing an element with a few attributes and a title::
276
277   >>> parser.feed('<img src="python-logo.png" alt="The Python logo">')
278   Start tag: img
279        attr: ('src', 'python-logo.png')
280        attr: ('alt', 'The Python logo')
281   >>>
282   >>> parser.feed('<h1>Python</h1>')
283   Start tag: h1
284   Data     : Python
285   End tag  : h1
286
287The content of ``script`` and ``style`` elements is returned as is, without
288further parsing::
289
290   >>> parser.feed('<style type="text/css">#python { color: green }</style>')
291   Start tag: style
292        attr: ('type', 'text/css')
293   Data     : #python { color: green }
294   End tag  : style
295
296   >>> parser.feed('<script type="text/javascript">'
297   ...             'alert("<strong>hello!</strong>");</script>')
298   Start tag: script
299        attr: ('type', 'text/javascript')
300   Data     : alert("<strong>hello!</strong>");
301   End tag  : script
302
303Parsing comments::
304
305   >>> parser.feed('<!-- a comment -->'
306   ...             '<!--[if IE 9]>IE-specific content<![endif]-->')
307   Comment  :  a comment
308   Comment  : [if IE 9]>IE-specific content<![endif]
309
310Parsing named and numeric character references and converting them to the
311correct char (note: these 3 references are all equivalent to ``'>'``)::
312
313   >>> parser.feed('&gt;&#62;&#x3E;')
314   Named ent: >
315   Num ent  : >
316   Num ent  : >
317
318Feeding incomplete chunks to :meth:`~HTMLParser.feed` works, but
319:meth:`~HTMLParser.handle_data` might be called more than once
320(unless *convert_charrefs* is set to ``True``)::
321
322   >>> for chunk in ['<sp', 'an>buff', 'ered ', 'text</s', 'pan>']:
323   ...     parser.feed(chunk)
324   ...
325   Start tag: span
326   Data     : buff
327   Data     : ered
328   Data     : text
329   End tag  : span
330
331Parsing invalid HTML (e.g. unquoted attributes) also works::
332
333   >>> parser.feed('<p><a class=link href=#main>tag soup</p ></a>')
334   Start tag: p
335   Start tag: a
336        attr: ('class', 'link')
337        attr: ('href', '#main')
338   Data     : tag soup
339   End tag  : p
340   End tag  : a
341