1# bluemonday [![Build Status](https://travis-ci.org/microcosm-cc/bluemonday.svg?branch=master)](https://travis-ci.org/microcosm-cc/bluemonday) [![GoDoc](https://godoc.org/github.com/microcosm-cc/bluemonday?status.png)](https://godoc.org/github.com/microcosm-cc/bluemonday) [![Sourcegraph](https://sourcegraph.com/github.com/microcosm-cc/bluemonday/-/badge.svg)](https://sourcegraph.com/github.com/microcosm-cc/bluemonday?badge) 2 3bluemonday is a HTML sanitizer implemented in Go. It is fast and highly configurable. 4 5bluemonday takes untrusted user generated content as an input, and will return HTML that has been sanitised against a whitelist of approved HTML elements and attributes so that you can safely include the content in your web page. 6 7If you accept user generated content, and your server uses Go, you **need** bluemonday. 8 9The default policy for user generated content (`bluemonday.UGCPolicy().Sanitize()`) turns this: 10```html 11Hello <STYLE>.XSS{background-image:url("javascript:alert('XSS')");}</STYLE><A CLASS=XSS></A>World 12``` 13 14Into a harmless: 15```html 16Hello World 17``` 18 19And it turns this: 20```html 21<a href="javascript:alert('XSS1')" onmouseover="alert('XSS2')">XSS<a> 22``` 23 24Into this: 25```html 26XSS 27``` 28 29Whilst still allowing this: 30```html 31<a href="http://www.google.com/"> 32 <img src="https://ssl.gstatic.com/accounts/ui/logo_2x.png"/> 33</a> 34``` 35 36To pass through mostly unaltered (it gained a rel="nofollow" which is a good thing for user generated content): 37```html 38<a href="http://www.google.com/" rel="nofollow"> 39 <img src="https://ssl.gstatic.com/accounts/ui/logo_2x.png"/> 40</a> 41``` 42 43It protects sites from [XSS](http://en.wikipedia.org/wiki/Cross-site_scripting) attacks. There are many [vectors for an XSS attack](https://www.owasp.org/index.php/XSS_Filter_Evasion_Cheat_Sheet) and the best way to mitigate the risk is to sanitize user input against a known safe list of HTML elements and attributes. 44 45You should **always** run bluemonday **after** any other processing. 46 47If you use [blackfriday](https://github.com/russross/blackfriday) or [Pandoc](http://johnmacfarlane.net/pandoc/) then bluemonday should be run after these steps. This ensures that no insecure HTML is introduced later in your process. 48 49bluemonday is heavily inspired by both the [OWASP Java HTML Sanitizer](https://code.google.com/p/owasp-java-html-sanitizer/) and the [HTML Purifier](http://htmlpurifier.org/). 50 51## Technical Summary 52 53Whitelist based, you need to either build a policy describing the HTML elements and attributes to permit (and the `regexp` patterns of attributes), or use one of the supplied policies representing good defaults. 54 55The policy containing the whitelist is applied using a fast non-validating, forward only, token-based parser implemented in the [Go net/html library](https://godoc.org/golang.org/x/net/html) by the core Go team. 56 57We expect to be supplied with well-formatted HTML (closing elements for every applicable open element, nested correctly) and so we do not focus on repairing badly nested or incomplete HTML. We focus on simply ensuring that whatever elements do exist are described in the policy whitelist and that attributes and links are safe for use on your web page. [GIGO](http://en.wikipedia.org/wiki/Garbage_in,_garbage_out) does apply and if you feed it bad HTML bluemonday is not tasked with figuring out how to make it good again. 58 59### Supported Go Versions 60 61bluemonday is tested against Go 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9, 1.10, 1.11, 1.12, and tip. 62 63We do not support Go 1.0 as we depend on `golang.org/x/net/html` which includes a reference to `io.ErrNoProgress` which did not exist in Go 1.0. 64 65We support Go 1.1 but Travis no longer tests against it. 66 67## Is it production ready? 68 69*Yes* 70 71We are using bluemonday in production having migrated from the widely used and heavily field tested OWASP Java HTML Sanitizer. 72 73We are passing our extensive test suite (including AntiSamy tests as well as tests for any issues raised). Check for any [unresolved issues](https://github.com/microcosm-cc/bluemonday/issues?page=1&state=open) to see whether anything may be a blocker for you. 74 75We invite pull requests and issues to help us ensure we are offering comprehensive protection against various attacks via user generated content. 76 77## Usage 78 79Install in your `${GOPATH}` using `go get -u github.com/microcosm-cc/bluemonday` 80 81Then call it: 82```go 83package main 84 85import ( 86 "fmt" 87 88 "github.com/microcosm-cc/bluemonday" 89) 90 91func main() { 92 // Do this once for each unique policy, and use the policy for the life of the program 93 // Policy creation/editing is not safe to use in multiple goroutines 94 p := bluemonday.UGCPolicy() 95 96 // The policy can then be used to sanitize lots of input and it is safe to use the policy in multiple goroutines 97 html := p.Sanitize( 98 `<a onblur="alert(secret)" href="http://www.google.com">Google</a>`, 99 ) 100 101 // Output: 102 // <a href="http://www.google.com" rel="nofollow">Google</a> 103 fmt.Println(html) 104} 105``` 106 107We offer three ways to call Sanitize: 108```go 109p.Sanitize(string) string 110p.SanitizeBytes([]byte) []byte 111p.SanitizeReader(io.Reader) bytes.Buffer 112``` 113 114If you are obsessed about performance, `p.SanitizeReader(r).Bytes()` will return a `[]byte` without performing any unnecessary casting of the inputs or outputs. Though the difference is so negligible you should never need to care. 115 116You can build your own policies: 117```go 118package main 119 120import ( 121 "fmt" 122 123 "github.com/microcosm-cc/bluemonday" 124) 125 126func main() { 127 p := bluemonday.NewPolicy() 128 129 // Require URLs to be parseable by net/url.Parse and either: 130 // mailto: http:// or https:// 131 p.AllowStandardURLs() 132 133 // We only allow <p> and <a href=""> 134 p.AllowAttrs("href").OnElements("a") 135 p.AllowElements("p") 136 137 html := p.Sanitize( 138 `<a onblur="alert(secret)" href="http://www.google.com">Google</a>`, 139 ) 140 141 // Output: 142 // <a href="http://www.google.com">Google</a> 143 fmt.Println(html) 144} 145``` 146 147We ship two default policies: 148 1491. `bluemonday.StrictPolicy()` which can be thought of as equivalent to stripping all HTML elements and their attributes as it has nothing on its whitelist. An example usage scenario would be blog post titles where HTML tags are not expected at all and if they are then the elements *and* the content of the elements should be stripped. This is a *very* strict policy. 1502. `bluemonday.UGCPolicy()` which allows a broad selection of HTML elements and attributes that are safe for user generated content. Note that this policy does *not* whitelist iframes, object, embed, styles, script, etc. An example usage scenario would be blog post bodies where a variety of formatting is expected along with the potential for TABLEs and IMGs. 151 152## Policy Building 153 154The essence of building a policy is to determine which HTML elements and attributes are considered safe for your scenario. OWASP provide an [XSS prevention cheat sheet](https://www.owasp.org/index.php/XSS_(Cross_Site_Scripting)_Prevention_Cheat_Sheet) to help explain the risks, but essentially: 155 1561. Avoid anything other than the standard HTML elements 1571. Avoid `script`, `style`, `iframe`, `object`, `embed`, `base` elements that allow code to be executed by the client or third party content to be included that can execute code 1581. Avoid anything other than plain HTML attributes with values matched to a regexp 159 160Basically, you should be able to describe what HTML is fine for your scenario. If you do not have confidence that you can describe your policy please consider using one of the shipped policies such as `bluemonday.UGCPolicy()`. 161 162To create a new policy: 163```go 164p := bluemonday.NewPolicy() 165``` 166 167To add elements to a policy either add just the elements: 168```go 169p.AllowElements("b", "strong") 170``` 171 172Or using a regex: 173 174_Note: if an element is added by name as shown above, any matching regex will be ignored_ 175 176It is also recommended to ensure multiple patterns don't overlap as order of execution is not guaranteed and can result in some rules being missed. 177```go 178p.AllowElementsMatching(regex.MustCompile(`^my-element-`)) 179``` 180 181Or add elements as a virtue of adding an attribute: 182```go 183// Not the recommended pattern, see the recommendation on using .Matching() below 184p.AllowAttrs("nowrap").OnElements("td", "th") 185``` 186 187Again, this also supports a regex pattern match alternative: 188```go 189p.AllowAttrs("nowrap").OnElementsMatching(regex.MustCompile(`^my-element-`)) 190``` 191 192Attributes can either be added to all elements: 193```go 194p.AllowAttrs("dir").Matching(regexp.MustCompile("(?i)rtl|ltr")).Globally() 195``` 196 197Or attributes can be added to specific elements: 198```go 199// Not the recommended pattern, see the recommendation on using .Matching() below 200p.AllowAttrs("value").OnElements("li") 201``` 202 203It is **always** recommended that an attribute be made to match a pattern. XSS in HTML attributes is very easy otherwise: 204```go 205// \p{L} matches unicode letters, \p{N} matches unicode numbers 206p.AllowAttrs("title").Matching(regexp.MustCompile(`[\p{L}\p{N}\s\-_',:\[\]!\./\\\(\)&]*`)).Globally() 207``` 208 209You can stop at any time and call .Sanitize(): 210```go 211// string htmlIn passed in from a HTTP POST 212htmlOut := p.Sanitize(htmlIn) 213``` 214 215And you can take any existing policy and extend it: 216```go 217p := bluemonday.UGCPolicy() 218p.AllowElements("fieldset", "select", "option") 219``` 220 221### Inline CSS 222 223Although it's possible to handle inline CSS using `AllowAttrs` with a `Matching` rule, writing a single monolithic regular expression to safely process all inline CSS which you wish to allow is not a trivial task. Instead of attempting to do so, you can whitelist the `style` attribute on whichever element(s) you desire and use style policies to control and sanitize inline styles. 224 225It is suggested that you use `Matching` (with a suitable regular expression) 226`MatchingEnum`, or `MatchingHandler` to ensure each style matches your needs, 227but default handlers are supplied for most widely used styles. 228 229Similar to attributes, you can allow specific CSS properties to be set inline: 230```go 231p.AllowAttrs("style").OnElements("span", "p") 232// Allow the 'color' property with valid RGB(A) hex values only (on any element allowed a 'style' attribute) 233p.AllowStyles("color").Matching(regexp.MustCompile("(?i)^#([0-9a-f]{3,4}|[0-9a-f]{6}|[0-9a-f]{8})$")).Globally() 234``` 235 236Additionally, you can allow a CSS property to be set only to an allowed value: 237```go 238p.AllowAttrs("style").OnElements("span", "p") 239// Allow the 'text-decoration' property to be set to 'underline', 'line-through' or 'none' 240// on 'span' elements only 241p.AllowStyles("text-decoration").MatchingEnum("underline", "line-through", "none").OnElements("span") 242``` 243 244Or you can specify elements based on a regex patterm match: 245```go 246p.AllowAttrs("style").OnElementsMatching(regex.MustCompile(`^my-element-`)) 247// Allow the 'text-decoration' property to be set to 'underline', 'line-through' or 'none' 248// on 'span' elements only 249p.AllowStyles("text-decoration").MatchingEnum("underline", "line-through", "none").OnElementsMatching(regex.MustCompile(`^my-element-`)) 250``` 251 252If you need more specific checking, you can create a handler that takes in a string and returns a bool to 253validate the values for a given property. The string parameter has been 254converted to lowercase and unicode code points have been converted. 255```go 256myHandler := func(value string) bool{ 257 return true 258} 259p.AllowAttrs("style").OnElements("span", "p") 260// Allow the 'color' property with values validated by the handler (on any element allowed a 'style' attribute) 261p.AllowStyles("color").MatchingHandler(myHandler).Globally() 262``` 263 264### Links 265 266Links are difficult beasts to sanitise safely and also one of the biggest attack vectors for malicious content. 267 268It is possible to do this: 269```go 270p.AllowAttrs("href").Matching(regexp.MustCompile(`(?i)mailto|https?`)).OnElements("a") 271``` 272 273But that will not protect you as the regular expression is insufficient in this case to have prevented a malformed value doing something unexpected. 274 275We provide some additional global options for safely working with links. 276 277`RequireParseableURLs` will ensure that URLs are parseable by Go's `net/url` package: 278```go 279p.RequireParseableURLs(true) 280``` 281 282If you have enabled parseable URLs then the following option will `AllowRelativeURLs`. By default this is disabled (bluemonday is a whitelist tool... you need to explicitly tell us to permit things) and when disabled it will prevent all local and scheme relative URLs (i.e. `href="localpage.html"`, `href="../home.html"` and even `href="//www.google.com"` are relative): 283```go 284p.AllowRelativeURLs(true) 285``` 286 287If you have enabled parseable URLs then you can whitelist the schemes (commonly called protocol when thinking of `http` and `https`) that are permitted. Bear in mind that allowing relative URLs in the above option will allow for a blank scheme: 288```go 289p.AllowURLSchemes("mailto", "http", "https") 290``` 291 292Regardless of whether you have enabled parseable URLs, you can force all URLs to have a rel="nofollow" attribute. This will be added if it does not exist, but only when the `href` is valid: 293```go 294// This applies to "a" "area" "link" elements that have a "href" attribute 295p.RequireNoFollowOnLinks(true) 296``` 297 298Similarly, you can force all URLs to have "noreferrer" in their rel attribute. 299```go 300// This applies to "a" "area" "link" elements that have a "href" attribute 301p.RequireNoReferrerOnLinks(true) 302``` 303 304 305We provide a convenience method that applies all of the above, but you will still need to whitelist the linkable elements for the URL rules to be applied to: 306```go 307p.AllowStandardURLs() 308p.AllowAttrs("cite").OnElements("blockquote", "q") 309p.AllowAttrs("href").OnElements("a", "area") 310p.AllowAttrs("src").OnElements("img") 311``` 312 313An additional complexity regarding links is the data URI as defined in [RFC2397](http://tools.ietf.org/html/rfc2397). The data URI allows for images to be served inline using this format: 314 315```html 316<img src="data:image/webp;base64,UklGRh4AAABXRUJQVlA4TBEAAAAvAAAAAAfQ//73v/+BiOh/AAA="> 317``` 318 319We have provided a helper to verify the mimetype followed by base64 content of data URIs links: 320 321```go 322p.AllowDataURIImages() 323``` 324 325That helper will enable GIF, JPEG, PNG and WEBP images. 326 327It should be noted that there is a potential [security](http://palizine.plynt.com/issues/2010Oct/bypass-xss-filters/) [risk](https://capec.mitre.org/data/definitions/244.html) with the use of data URI links. You should only enable data URI links if you already trust the content. 328 329We also have some features to help deal with user generated content: 330```go 331p.AddTargetBlankToFullyQualifiedLinks(true) 332``` 333 334This will ensure that anchor `<a href="" />` links that are fully qualified (the href destination includes a host name) will get `target="_blank"` added to them. 335 336Additionally any link that has `target="_blank"` after the policy has been applied will also have the `rel` attribute adjusted to add `noopener`. This means a link may start like `<a href="//host/path"/>` and will end up as `<a href="//host/path" rel="noopener" target="_blank">`. It is important to note that the addition of `noopener` is a security feature and not an issue. There is an unfortunate feature to browsers that a browser window opened as a result of `target="_blank"` can still control the opener (your web page) and this protects against that. The background to this can be found here: [https://dev.to/ben/the-targetblank-vulnerability-by-example](https://dev.to/ben/the-targetblank-vulnerability-by-example) 337 338### Policy Building Helpers 339 340We also bundle some helpers to simplify policy building: 341```go 342 343// Permits the "dir", "id", "lang", "title" attributes globally 344p.AllowStandardAttributes() 345 346// Permits the "img" element and its standard attributes 347p.AllowImages() 348 349// Permits ordered and unordered lists, and also definition lists 350p.AllowLists() 351 352// Permits HTML tables and all applicable elements and non-styling attributes 353p.AllowTables() 354``` 355 356### Invalid Instructions 357 358The following are invalid: 359```go 360// This does not say where the attributes are allowed, you need to add 361// .Globally() or .OnElements(...) 362// This will be ignored without error. 363p.AllowAttrs("value") 364 365// This does not say where the attributes are allowed, you need to add 366// .Globally() or .OnElements(...) 367// This will be ignored without error. 368p.AllowAttrs( 369 "type", 370).Matching( 371 regexp.MustCompile("(?i)^(circle|disc|square|a|A|i|I|1)$"), 372) 373``` 374 375Both examples exhibit the same issue, they declare attributes but do not then specify whether they are whitelisted globally or only on specific elements (and which elements). Attributes belong to one or more elements, and the policy needs to declare this. 376 377## Limitations 378 379We are not yet including any tools to help whitelist and sanitize CSS. Which means that unless you wish to do the heavy lifting in a single regular expression (inadvisable), **you should not allow the "style" attribute anywhere**. 380 381It is not the job of bluemonday to fix your bad HTML, it is merely the job of bluemonday to prevent malicious HTML getting through. If you have mismatched HTML elements, or non-conforming nesting of elements, those will remain. But if you have well-structured HTML bluemonday will not break it. 382 383## TODO 384 385* Investigate whether devs want to blacklist elements and attributes. This would allow devs to take an existing policy (such as the `bluemonday.UGCPolicy()` ) that encapsulates 90% of what they're looking for but does more than they need, and to remove the extra things they do not want to make it 100% what they want 386* Investigate whether devs want a validating HTML mode, in which the HTML elements are not just transformed into a balanced tree (every start tag has a closing tag at the correct depth) but also that elements and character data appear only in their allowed context (i.e. that a `table` element isn't a descendent of a `caption`, that `colgroup`, `thead`, `tbody`, `tfoot` and `tr` are permitted, and that character data is not permitted) 387 388## Development 389 390If you have cloned this repo you will probably need the dependency: 391 392`go get golang.org/x/net/html` 393 394Gophers can use their familiar tools: 395 396`go build` 397 398`go test` 399 400I personally use a Makefile as it spares typing the same args over and over whilst providing consistency for those of us who jump from language to language and enjoy just typing `make` in a project directory and watch magic happen. 401 402`make` will build, vet, test and install the library. 403 404`make clean` will remove the library from a *single* `${GOPATH}/pkg` directory tree 405 406`make test` will run the tests 407 408`make cover` will run the tests and *open a browser window* with the coverage report 409 410`make lint` will run golint (install via `go get github.com/golang/lint/golint`) 411 412## Long term goals 413 4141. Open the code to adversarial peer review similar to the [Attack Review Ground Rules](https://code.google.com/p/owasp-java-html-sanitizer/wiki/AttackReviewGroundRules) 4151. Raise funds and pay for an external security review 416