• Home
  • History
  • Annotate
Name Date Size #Lines LOC

..03-May-2022-

benchmarks/H03-May-2022-7549

examples/H04-Sep-2020-969658

python/gumbo/H04-Sep-2020-1,268910

src/H04-Sep-2020-36,09133,233

tests/H04-Sep-2020-4,0833,054

visualc/H04-Sep-2020-214212

.clang-formatH A D04-Sep-20201.9 KiB6664

.gitignoreH A D04-Sep-2020878 8070

.travis.ymlH A D04-Sep-2020640 2722

CHANGES.mdH A D04-Sep-20202.6 KiB6450

CONTRIBUTING.mdH A D04-Sep-20203.6 KiB4027

COPYINGH A D04-Sep-202011.1 KiB202169

DEBUGGING.mdH A D04-Sep-20203.5 KiB10883

DoxyfileH A D04-Sep-202073.8 KiB1,7821,290

Makefile.amH A D04-Sep-20203.8 KiB11986

README.mdH A D04-Sep-20208.3 KiB219175

THANKSH A D04-Sep-2020595 2825

appveyor.ymlH A D04-Sep-202082 54

autogen.shH A D04-Sep-20201.2 KiB3315

configure.acH A D04-Sep-2020929 3627

genperf.pyH A D04-Sep-2020861 3123

gentags.pyH A D04-Sep-2020971 3526

gtest.gypH A D04-Sep-20204.4 KiB134132

gumbo.pc.inH A D04-Sep-2020238 1210

gumbo_parser.gypH A D04-Sep-20202 KiB7775

setup.pyH A D04-Sep-20206.4 KiB186138

README.md

1Gumbo - A pure-C HTML5 parser.
2============
3
4[![Build Status](https://travis-ci.org/google/gumbo-parser.svg?branch=master)](https://travis-ci.org/google/gumbo-parser) [![Build status](https://ci.appveyor.com/api/projects/status/k5xxn4bxf62ao2cp?svg=true)](https://ci.appveyor.com/project/nostrademons/gumbo-parser)
5
6Gumbo is an implementation of the [HTML5 parsing algorithm][] implemented
7as a pure C99 library with no outside dependencies.  It's designed to serve
8as a building block for other tools and libraries such as linters,
9validators, templating languages, and refactoring and analysis tools.
10
11Goals & features:
12
13* Fully conformant with the [HTML5 spec][].
14* Robust and resilient to bad input.
15* Simple API that can be easily wrapped by other languages.
16* Support for source locations and pointers back to the original text.
17* Support for fragment parsing.
18* Relatively lightweight, with no outside dependencies.
19* Passes all [html5lib tests][], including the template tag.
20* Tested on over 2.5 billion pages from Google's index.
21
22Non-goals:
23
24* Execution speed.  Gumbo gains some of this by virtue of being written in
25  C, but it is not an important consideration for the intended use-case, and
26  was not a major design factor.
27* Support for encodings other than UTF-8.  For the most part, client code
28  can convert the input stream to UTF-8 text using another library before
29  processing.
30* Mutability.  Gumbo is intentionally designed to turn an HTML document into a
31  parse tree, and free that parse tree all at once.  It's not designed to
32  persistently store nodes or subtrees outside of the parse tree, or to perform
33  arbitrary DOM mutations within your program.  If you need this functionality,
34  we recommend translating the Gumbo parse tree into a mutable DOM
35  representation more suited for the particular needs of your program before
36  operating on it.
37* C89 support.  Most major compilers support C99 by now; the major exception
38  (Microsoft Visual Studio) should be able to compile this in C++ mode with
39  relatively few changes.  (Bug reports welcome.)
40* ~~Security.  Gumbo was initially designed for a product that worked with
41  trusted input files only.  We're working to harden this and make sure that it
42  behaves as expected even on malicious input, but for now, Gumbo should only be
43  run on trusted input or within a sandbox.~~ Gumbo underwent a number of
44  security fixes and passed Google's security review as of version 0.9.1.
45
46Wishlist (aka "We couldn't get these into the original release, but are
47hoping to add them soon"):
48
49* Full-featured error reporting.
50* Additional performance improvements.
51* DOM wrapper library/libraries (possibly within other language bindings)
52* Query libraries, to extract information from parse trees using CSS or XPATH.
53
54Installation
55============
56
57To build and install the library, issue the standard UNIX incantation from
58the root of the distribution:
59
60```bash
61$ ./autogen.sh
62$ ./configure
63$ make
64$ sudo make install
65```
66
67Gumbo comes with full pkg-config support, so you can use the pkg-config to
68print the flags needed to link your program against it:
69
70```bash
71$ pkg-config --cflags gumbo         # print compiler flags
72$ pkg-config --libs gumbo           # print linker flags
73$ pkg-config --cflags --libs gumbo  # print both
74```
75
76For example:
77
78```bash
79$ gcc my_program.c `pkg-config --cflags --libs gumbo`
80```
81
82See the pkg-config man page for more info.
83
84There are a number of sample programs in the examples/ directory.  They're
85built automatically by 'make', but can also be made individually with
86`make <programname>` (eg. `make clean_text`).
87
88To run the unit tests, you'll need to have [googletest][] downloaded and
89unzipped.  The googletest maintainers recommend against using
90`make install`; instead, symlink the root googletest directory to 'gtest'
91inside gumbo's root directory, and then `make check`:
92
93```bash
94$ unzip gtest-1.6.0.zip
95$ cd gumbo-*
96$ ln -s ../gtest-1.6.0 gtest
97$ make check
98```
99
100Gumbo's `make check` has code to automatically configure & build gtest and
101then link in the library.
102
103Debian and Fedora users can install libgtest with:
104
105```bash
106$ apt-get install libgtest-dev  # Debian/Ubuntu
107$ yum install gtest-devel       # CentOS/Fedora
108```
109
110Note for Ubuntu users: libgtest-dev package only install source files.
111You have to make libraries yourself using cmake:
112
113    $ sudo apt-get install cmake
114    $ cd /usr/src/gtest
115    $ sudo cmake CMakeLists.txt
116    $ sudo make
117    $ sudo cp *.a /usr/lib
118
119The configure script will detect the presence of the library and use that
120instead.
121
122Note that you need to have super user privileges to execute these commands.
123On most distros, you can prefix the commands above with `sudo` to execute
124them as the super user.
125
126Debian installs usually don't have `sudo` installed (Ubuntu however does.)
127Switch users first with `su -`, then run `apt-get`.
128
129Basic Usage
130===========
131
132Within your program, you need to include "gumbo.h" and then issue a call to
133`gumbo_parse`:
134
135```C
136#include "gumbo.h"
137
138int main() {
139  GumboOutput* output = gumbo_parse("<h1>Hello, World!</h1>");
140  // Do stuff with output->root
141  gumbo_destroy_output(&kGumboDefaultOptions, output);
142}
143```
144
145See the API documentation and sample programs for more details.
146
147A note on API/ABI compatibility
148===============================
149
150We'll make a best effort to preserve API compatibility between releases.
151The initial release is a 0.9 (beta) release to solicit comments from early
152adopters, but if no major problems are found with the API, a 1.0 release
153will follow shortly, and the API of that should be considered stable.  If
154changes are necessary, we follow [semantic versioning][].
155
156We make no such guarantees about the ABI, and it's very likely that
157subsequent versions may require a recompile of client code.  For this
158reason, we recommend NOT using Gumbo data structures throughout a program,
159and instead limiting them to a translation layer that picks out whatever
160data is needed from the parse tree and then converts that to persistent
161data structures more appropriate for the application.  The API is
162structured to encourage this use, with a single delete function for the
163whole parse tree, and is not designed with mutation in mind.
164
165Python usage
166============
167
168To install the python bindings, make sure that the
169C library is installed first, and then `sudo python setup.py install` from
170the root of the distro.  This installs a 'gumbo' module; `pydoc gumbo`
171should tell you about it.
172
173Recommended best-practice for Python usage is to use one of the adapters to
174an existing API (personally, I prefer BeautifulSoup) and write your program
175in terms of those.  The raw CTypes bindings should be considered building
176blocks for higher-level libraries and rarely referenced directly.
177
178External Bindings and other wrappers
179====================================
180
181The following language bindings or other tools/wrappers are maintained by
182various contributors in other repositories:
183
184* C++: [gumbo-query] by lazytiger
185* Ruby:
186  * [ruby-gumbo] by Nicolas Martyanoff
187  * [nokogumbo] by Sam Ruby
188* Node.js: [node-gumbo-parser] by Karl Westin
189* D: [gumbo-d] by Christopher Bertels
190* Lua: [lua-gumbo] by Craig Barnes
191* Objective-C:
192  * [ObjectiveGumbo] by Programming Thomas
193  * [OCGumbo] by TracyYih
194* C#: [GumboBindings] by Vladimir Zotov
195* PHP: [GumboPHP] by Paul Preece
196* Perl: [HTML::Gumbo] by Ruslan Zakirov
197* Julia: [Gumbo.jl] by James Porter
198* C/Libxml: [gumbo-libxml] by Jonathan Tang
199
200[gumbo-query]: https://github.com/lazytiger/gumbo-query
201[ruby-gumbo]: https://github.com/nevir/ruby-gumbo
202[nokogumbo]: https://github.com/rubys/nokogumbo
203[node-gumbo-parser]: https://github.com/karlwestin/node-gumbo-parser
204[gumbo-d]: https://github.com/bakkdoor/gumbo-d
205[lua-gumbo]: https://github.com/craigbarnes/lua-gumbo
206[OCGumbo]: https://github.com/tracy-e/OCGumbo
207[ObjectiveGumbo]: https://github.com/programmingthomas/ObjectiveGumbo
208[GumboBindings]: https://github.com/rgripper/GumboBindings
209[GumboPHP]: https://github.com/BipSync/gumbo
210[Gumbo.jl]: https://github.com/porterjamesj/Gumbo.jl
211[gumbo-libxml]: https://github.com/nostrademons/gumbo-libxml
212
213[HTML5 parsing algorithm]: http://www.whatwg.org/specs/web-apps/current-work/multipage/#auto-toc-12
214[HTML5 spec]: http://www.whatwg.org/specs/web-apps/current-work/multipage/
215[html5lib tests]: https://github.com/html5lib/html5lib-tests
216[googletest]: https://code.google.com/p/googletest/
217[semantic versioning]: http://semver.org/
218[HTML::Gumbo]: https://metacpan.org/pod/HTML::Gumbo
219