1# htmlparser2 2 3[![NPM version](http://img.shields.io/npm/v/htmlparser2.svg?style=flat)](https://npmjs.org/package/htmlparser2) 4[![Downloads](https://img.shields.io/npm/dm/htmlparser2.svg?style=flat)](https://npmjs.org/package/htmlparser2) 5[![Build Status](http://img.shields.io/travis/fb55/htmlparser2/master.svg?style=flat)](http://travis-ci.org/fb55/htmlparser2) 6[![Coverage](http://img.shields.io/coveralls/fb55/htmlparser2.svg?style=flat)](https://coveralls.io/r/fb55/htmlparser2) 7 8A forgiving HTML/XML/RSS parser. The parser can handle streams and provides a callback interface. 9 10## Installation 11 npm install htmlparser2 12 13A live demo of htmlparser2 is available [here](https://astexplorer.net/#/2AmVrGuGVJ). 14 15## Usage 16 17```javascript 18var htmlparser = require("htmlparser2"); 19var parser = new htmlparser.Parser({ 20 onopentag: function(name, attribs){ 21 if(name === "script" && attribs.type === "text/javascript"){ 22 console.log("JS! Hooray!"); 23 } 24 }, 25 ontext: function(text){ 26 console.log("-->", text); 27 }, 28 onclosetag: function(tagname){ 29 if(tagname === "script"){ 30 console.log("That's it?!"); 31 } 32 } 33}, {decodeEntities: true}); 34parser.write("Xyz <script type='text/javascript'>var foo = '<<bar>>';</ script>"); 35parser.end(); 36``` 37 38Output (simplified): 39 40``` 41--> Xyz 42JS! Hooray! 43--> var foo = '<<bar>>'; 44That's it?! 45``` 46 47## Documentation 48 49Read more about the parser and its options in the [wiki](https://github.com/fb55/htmlparser2/wiki/Parser-options). 50 51## Get a DOM 52The `DomHandler` (known as `DefaultHandler` in the original `htmlparser` module) produces a DOM (document object model) that can be manipulated using the [`DomUtils`](https://github.com/fb55/DomUtils) helper. 53 54The `DomHandler`, while still bundled with this module, was moved to its [own module](https://github.com/fb55/domhandler). Have a look at it for further information. 55 56## Parsing RSS/RDF/Atom Feeds 57 58```javascript 59new htmlparser.FeedHandler(function(<error> error, <object> feed){ 60 ... 61}); 62``` 63 64Note: While the provided feed handler works for most feeds, you might want to use [danmactough/node-feedparser](https://github.com/danmactough/node-feedparser), which is much better tested and actively maintained. 65 66## Performance 67 68After having some artificial benchmarks for some time, __@AndreasMadsen__ published his [`htmlparser-benchmark`](https://github.com/AndreasMadsen/htmlparser-benchmark), which benchmarks HTML parses based on real-world websites. 69 70At the time of writing, the latest versions of all supported parsers show the following performance characteristics on [Travis CI](https://travis-ci.org/AndreasMadsen/htmlparser-benchmark/builds/10805007) (please note that Travis doesn't guarantee equal conditions for all tests): 71 72``` 73gumbo-parser : 34.9208 ms/file ± 21.4238 74html-parser : 24.8224 ms/file ± 15.8703 75html5 : 419.597 ms/file ± 264.265 76htmlparser : 60.0722 ms/file ± 384.844 77htmlparser2-dom: 12.0749 ms/file ± 6.49474 78htmlparser2 : 7.49130 ms/file ± 5.74368 79hubbub : 30.4980 ms/file ± 16.4682 80libxmljs : 14.1338 ms/file ± 18.6541 81parse5 : 22.0439 ms/file ± 15.3743 82sax : 49.6513 ms/file ± 26.6032 83``` 84 85## How does this module differ from [node-htmlparser](https://github.com/tautologistics/node-htmlparser)? 86 87This is a fork of the `htmlparser` module. The main difference is that this is intended to be used only with node (it runs on other platforms using [browserify](https://github.com/substack/node-browserify)). `htmlparser2` was rewritten multiple times and, while it maintains an API that's compatible with `htmlparser` in most cases, the projects don't share any code anymore. 88 89The parser now provides a callback interface close to [sax.js](https://github.com/isaacs/sax-js) (originally targeted at [readabilitySAX](https://github.com/fb55/readabilitysax)). As a result, old handlers won't work anymore. 90 91The `DefaultHandler` and the `RssHandler` were renamed to clarify their purpose (to `DomHandler` and `FeedHandler`). The old names are still available when requiring `htmlparser2`, your code should work as expected. 92