1# Purell 2 3Purell is a tiny Go library to normalize URLs. It returns a pure URL. Pure-ell. Sanitizer and all. Yeah, I know... 4 5Based on the [wikipedia paper][wiki] and the [RFC 3986 document][rfc]. 6 7[![build status](https://travis-ci.org/PuerkitoBio/purell.svg?branch=master)](http://travis-ci.org/PuerkitoBio/purell) 8 9## Install 10 11`go get github.com/PuerkitoBio/purell` 12 13## Changelog 14 15* **v1.1.1** : Fix failing test due to Go1.12 changes (thanks to @ianlancetaylor). 16* **2016-11-14 (v1.1.0)** : IDN: Conform to RFC 5895: Fold character width (thanks to @beeker1121). 17* **2016-07-27 (v1.0.0)** : Normalize IDN to ASCII (thanks to @zenovich). 18* **2015-02-08** : Add fix for relative paths issue ([PR #5][pr5]) and add fix for unnecessary encoding of reserved characters ([see issue #7][iss7]). 19* **v0.2.0** : Add benchmarks, Attempt IDN support. 20* **v0.1.0** : Initial release. 21 22## Examples 23 24From `example_test.go` (note that in your code, you would import "github.com/PuerkitoBio/purell", and would prefix references to its methods and constants with "purell."): 25 26```go 27package purell 28 29import ( 30 "fmt" 31 "net/url" 32) 33 34func ExampleNormalizeURLString() { 35 if normalized, err := NormalizeURLString("hTTp://someWEBsite.com:80/Amazing%3f/url/", 36 FlagLowercaseScheme|FlagLowercaseHost|FlagUppercaseEscapes); err != nil { 37 panic(err) 38 } else { 39 fmt.Print(normalized) 40 } 41 // Output: http://somewebsite.com:80/Amazing%3F/url/ 42} 43 44func ExampleMustNormalizeURLString() { 45 normalized := MustNormalizeURLString("hTTpS://someWEBsite.com:443/Amazing%fa/url/", 46 FlagsUnsafeGreedy) 47 fmt.Print(normalized) 48 49 // Output: http://somewebsite.com/Amazing%FA/url 50} 51 52func ExampleNormalizeURL() { 53 if u, err := url.Parse("Http://SomeUrl.com:8080/a/b/.././c///g?c=3&a=1&b=9&c=0#target"); err != nil { 54 panic(err) 55 } else { 56 normalized := NormalizeURL(u, FlagsUsuallySafeGreedy|FlagRemoveDuplicateSlashes|FlagRemoveFragment) 57 fmt.Print(normalized) 58 } 59 60 // Output: http://someurl.com:8080/a/c/g?c=3&a=1&b=9&c=0 61} 62``` 63 64## API 65 66As seen in the examples above, purell offers three methods, `NormalizeURLString(string, NormalizationFlags) (string, error)`, `MustNormalizeURLString(string, NormalizationFlags) (string)` and `NormalizeURL(*url.URL, NormalizationFlags) (string)`. They all normalize the provided URL based on the specified flags. Here are the available flags: 67 68```go 69const ( 70 // Safe normalizations 71 FlagLowercaseScheme NormalizationFlags = 1 << iota // HTTP://host -> http://host, applied by default in Go1.1 72 FlagLowercaseHost // http://HOST -> http://host 73 FlagUppercaseEscapes // http://host/t%ef -> http://host/t%EF 74 FlagDecodeUnnecessaryEscapes // http://host/t%41 -> http://host/tA 75 FlagEncodeNecessaryEscapes // http://host/!"#$ -> http://host/%21%22#$ 76 FlagRemoveDefaultPort // http://host:80 -> http://host 77 FlagRemoveEmptyQuerySeparator // http://host/path? -> http://host/path 78 79 // Usually safe normalizations 80 FlagRemoveTrailingSlash // http://host/path/ -> http://host/path 81 FlagAddTrailingSlash // http://host/path -> http://host/path/ (should choose only one of these add/remove trailing slash flags) 82 FlagRemoveDotSegments // http://host/path/./a/b/../c -> http://host/path/a/c 83 84 // Unsafe normalizations 85 FlagRemoveDirectoryIndex // http://host/path/index.html -> http://host/path/ 86 FlagRemoveFragment // http://host/path#fragment -> http://host/path 87 FlagForceHTTP // https://host -> http://host 88 FlagRemoveDuplicateSlashes // http://host/path//a///b -> http://host/path/a/b 89 FlagRemoveWWW // http://www.host/ -> http://host/ 90 FlagAddWWW // http://host/ -> http://www.host/ (should choose only one of these add/remove WWW flags) 91 FlagSortQuery // http://host/path?c=3&b=2&a=1&b=1 -> http://host/path?a=1&b=1&b=2&c=3 92 93 // Normalizations not in the wikipedia article, required to cover tests cases 94 // submitted by jehiah 95 FlagDecodeDWORDHost // http://1113982867 -> http://66.102.7.147 96 FlagDecodeOctalHost // http://0102.0146.07.0223 -> http://66.102.7.147 97 FlagDecodeHexHost // http://0x42660793 -> http://66.102.7.147 98 FlagRemoveUnnecessaryHostDots // http://.host../path -> http://host/path 99 FlagRemoveEmptyPortSeparator // http://host:/path -> http://host/path 100 101 // Convenience set of safe normalizations 102 FlagsSafe NormalizationFlags = FlagLowercaseHost | FlagLowercaseScheme | FlagUppercaseEscapes | FlagDecodeUnnecessaryEscapes | FlagEncodeNecessaryEscapes | FlagRemoveDefaultPort | FlagRemoveEmptyQuerySeparator 103 104 // For convenience sets, "greedy" uses the "remove trailing slash" and "remove www. prefix" flags, 105 // while "non-greedy" uses the "add (or keep) the trailing slash" and "add www. prefix". 106 107 // Convenience set of usually safe normalizations (includes FlagsSafe) 108 FlagsUsuallySafeGreedy NormalizationFlags = FlagsSafe | FlagRemoveTrailingSlash | FlagRemoveDotSegments 109 FlagsUsuallySafeNonGreedy NormalizationFlags = FlagsSafe | FlagAddTrailingSlash | FlagRemoveDotSegments 110 111 // Convenience set of unsafe normalizations (includes FlagsUsuallySafe) 112 FlagsUnsafeGreedy NormalizationFlags = FlagsUsuallySafeGreedy | FlagRemoveDirectoryIndex | FlagRemoveFragment | FlagForceHTTP | FlagRemoveDuplicateSlashes | FlagRemoveWWW | FlagSortQuery 113 FlagsUnsafeNonGreedy NormalizationFlags = FlagsUsuallySafeNonGreedy | FlagRemoveDirectoryIndex | FlagRemoveFragment | FlagForceHTTP | FlagRemoveDuplicateSlashes | FlagAddWWW | FlagSortQuery 114 115 // Convenience set of all available flags 116 FlagsAllGreedy = FlagsUnsafeGreedy | FlagDecodeDWORDHost | FlagDecodeOctalHost | FlagDecodeHexHost | FlagRemoveUnnecessaryHostDots | FlagRemoveEmptyPortSeparator 117 FlagsAllNonGreedy = FlagsUnsafeNonGreedy | FlagDecodeDWORDHost | FlagDecodeOctalHost | FlagDecodeHexHost | FlagRemoveUnnecessaryHostDots | FlagRemoveEmptyPortSeparator 118) 119``` 120 121For convenience, the set of flags `FlagsSafe`, `FlagsUsuallySafe[Greedy|NonGreedy]`, `FlagsUnsafe[Greedy|NonGreedy]` and `FlagsAll[Greedy|NonGreedy]` are provided for the similarly grouped normalizations on [wikipedia's URL normalization page][wiki]. You can add (using the bitwise OR `|` operator) or remove (using the bitwise AND NOT `&^` operator) individual flags from the sets if required, to build your own custom set. 122 123The [full godoc reference is available on gopkgdoc][godoc]. 124 125Some things to note: 126 127* `FlagDecodeUnnecessaryEscapes`, `FlagEncodeNecessaryEscapes`, `FlagUppercaseEscapes` and `FlagRemoveEmptyQuerySeparator` are always implicitly set, because internally, the URL string is parsed as an URL object, which automatically decodes unnecessary escapes, uppercases and encodes necessary ones, and removes empty query separators (an unnecessary `?` at the end of the url). So this operation cannot **not** be done. For this reason, `FlagRemoveEmptyQuerySeparator` (as well as the other three) has been included in the `FlagsSafe` convenience set, instead of `FlagsUnsafe`, where Wikipedia puts it. 128 129* The `FlagDecodeUnnecessaryEscapes` decodes the following escapes (*from -> to*): 130 - %24 -> $ 131 - %26 -> & 132 - %2B-%3B -> +,-./0123456789:; 133 - %3D -> = 134 - %40-%5A -> @ABCDEFGHIJKLMNOPQRSTUVWXYZ 135 - %5F -> _ 136 - %61-%7A -> abcdefghijklmnopqrstuvwxyz 137 - %7E -> ~ 138 139 140* When the `NormalizeURL` function is used (passing an URL object), this source URL object is modified (that is, after the call, the URL object will be modified to reflect the normalization). 141 142* The *replace IP with domain name* normalization (`http://208.77.188.166/ → http://www.example.com/`) is obviously not possible for a library without making some network requests. This is not implemented in purell. 143 144* The *remove unused query string parameters* and *remove default query parameters* are also not implemented, since this is a very case-specific normalization, and it is quite trivial to do with an URL object. 145 146### Safe vs Usually Safe vs Unsafe 147 148Purell allows you to control the level of risk you take while normalizing an URL. You can aggressively normalize, play it totally safe, or anything in between. 149 150Consider the following URL: 151 152`HTTPS://www.RooT.com/toto/t%45%1f///a/./b/../c/?z=3&w=2&a=4&w=1#invalid` 153 154Normalizing with the `FlagsSafe` gives: 155 156`https://www.root.com/toto/tE%1F///a/./b/../c/?z=3&w=2&a=4&w=1#invalid` 157 158With the `FlagsUsuallySafeGreedy`: 159 160`https://www.root.com/toto/tE%1F///a/c?z=3&w=2&a=4&w=1#invalid` 161 162And with `FlagsUnsafeGreedy`: 163 164`http://root.com/toto/tE%1F/a/c?a=4&w=1&w=2&z=3` 165 166## TODOs 167 168* Add a class/default instance to allow specifying custom directory index names? At the moment, removing directory index removes `(^|/)((?:default|index)\.\w{1,4})$`. 169 170## Thanks / Contributions 171 172@rogpeppe 173@jehiah 174@opennota 175@pchristopher1275 176@zenovich 177@beeker1121 178 179## License 180 181The [BSD 3-Clause license][bsd]. 182 183[bsd]: http://opensource.org/licenses/BSD-3-Clause 184[wiki]: http://en.wikipedia.org/wiki/URL_normalization 185[rfc]: http://tools.ietf.org/html/rfc3986#section-6 186[godoc]: http://go.pkgdoc.org/github.com/PuerkitoBio/purell 187[pr5]: https://github.com/PuerkitoBio/purell/pull/5 188[iss7]: https://github.com/PuerkitoBio/purell/issues/7 189