• Home
  • History
  • Annotate
Name Date Size #Lines LOC

..15-Oct-2020-

LICENSEH A D15-Oct-20201.4 KiB137

README.mdH A D15-Oct-20209.1 KiB188134

purell.goH A D15-Oct-202012.3 KiB380293

README.md

1# Purell
2
3Purell is a tiny Go library to normalize URLs. It returns a pure URL. Pure-ell. Sanitizer and all. Yeah, I know...
4
5Based on the [wikipedia paper][wiki] and the [RFC 3986 document][rfc].
6
7[![build status](https://secure.travis-ci.org/PuerkitoBio/purell.png)](http://travis-ci.org/PuerkitoBio/purell)
8
9## Install
10
11`go get github.com/PuerkitoBio/purell`
12
13## Changelog
14
15*    **2016-11-14 (v1.1.0)** : IDN: Conform to RFC 5895: Fold character width (thanks to @beeker1121).
16*    **2016-07-27 (v1.0.0)** : Normalize IDN to ASCII (thanks to @zenovich).
17*    **2015-02-08** : Add fix for relative paths issue ([PR #5][pr5]) and add fix for unnecessary encoding of reserved characters ([see issue #7][iss7]).
18*    **v0.2.0** : Add benchmarks, Attempt IDN support.
19*    **v0.1.0** : Initial release.
20
21## Examples
22
23From `example_test.go` (note that in your code, you would import "github.com/PuerkitoBio/purell", and would prefix references to its methods and constants with "purell."):
24
25```go
26package purell
27
28import (
29  "fmt"
30  "net/url"
31)
32
33func ExampleNormalizeURLString() {
34  if normalized, err := NormalizeURLString("hTTp://someWEBsite.com:80/Amazing%3f/url/",
35    FlagLowercaseScheme|FlagLowercaseHost|FlagUppercaseEscapes); err != nil {
36    panic(err)
37  } else {
38    fmt.Print(normalized)
39  }
40  // Output: http://somewebsite.com:80/Amazing%3F/url/
41}
42
43func ExampleMustNormalizeURLString() {
44  normalized := MustNormalizeURLString("hTTpS://someWEBsite.com:443/Amazing%fa/url/",
45    FlagsUnsafeGreedy)
46  fmt.Print(normalized)
47
48  // Output: http://somewebsite.com/Amazing%FA/url
49}
50
51func ExampleNormalizeURL() {
52  if u, err := url.Parse("Http://SomeUrl.com:8080/a/b/.././c///g?c=3&a=1&b=9&c=0#target"); err != nil {
53    panic(err)
54  } else {
55    normalized := NormalizeURL(u, FlagsUsuallySafeGreedy|FlagRemoveDuplicateSlashes|FlagRemoveFragment)
56    fmt.Print(normalized)
57  }
58
59  // Output: http://someurl.com:8080/a/c/g?c=3&a=1&b=9&c=0
60}
61```
62
63## API
64
65As seen in the examples above, purell offers three methods, `NormalizeURLString(string, NormalizationFlags) (string, error)`, `MustNormalizeURLString(string, NormalizationFlags) (string)` and `NormalizeURL(*url.URL, NormalizationFlags) (string)`. They all normalize the provided URL based on the specified flags. Here are the available flags:
66
67```go
68const (
69	// Safe normalizations
70	FlagLowercaseScheme           NormalizationFlags = 1 << iota // HTTP://host -> http://host, applied by default in Go1.1
71	FlagLowercaseHost                                            // http://HOST -> http://host
72	FlagUppercaseEscapes                                         // http://host/t%ef -> http://host/t%EF
73	FlagDecodeUnnecessaryEscapes                                 // http://host/t%41 -> http://host/tA
74	FlagEncodeNecessaryEscapes                                   // http://host/!"#$ -> http://host/%21%22#$
75	FlagRemoveDefaultPort                                        // http://host:80 -> http://host
76	FlagRemoveEmptyQuerySeparator                                // http://host/path? -> http://host/path
77
78	// Usually safe normalizations
79	FlagRemoveTrailingSlash // http://host/path/ -> http://host/path
80	FlagAddTrailingSlash    // http://host/path -> http://host/path/ (should choose only one of these add/remove trailing slash flags)
81	FlagRemoveDotSegments   // http://host/path/./a/b/../c -> http://host/path/a/c
82
83	// Unsafe normalizations
84	FlagRemoveDirectoryIndex   // http://host/path/index.html -> http://host/path/
85	FlagRemoveFragment         // http://host/path#fragment -> http://host/path
86	FlagForceHTTP              // https://host -> http://host
87	FlagRemoveDuplicateSlashes // http://host/path//a///b -> http://host/path/a/b
88	FlagRemoveWWW              // http://www.host/ -> http://host/
89	FlagAddWWW                 // http://host/ -> http://www.host/ (should choose only one of these add/remove WWW flags)
90	FlagSortQuery              // http://host/path?c=3&b=2&a=1&b=1 -> http://host/path?a=1&b=1&b=2&c=3
91
92	// Normalizations not in the wikipedia article, required to cover tests cases
93	// submitted by jehiah
94	FlagDecodeDWORDHost           // http://1113982867 -> http://66.102.7.147
95	FlagDecodeOctalHost           // http://0102.0146.07.0223 -> http://66.102.7.147
96	FlagDecodeHexHost             // http://0x42660793 -> http://66.102.7.147
97	FlagRemoveUnnecessaryHostDots // http://.host../path -> http://host/path
98	FlagRemoveEmptyPortSeparator  // http://host:/path -> http://host/path
99
100	// Convenience set of safe normalizations
101	FlagsSafe NormalizationFlags = FlagLowercaseHost | FlagLowercaseScheme | FlagUppercaseEscapes | FlagDecodeUnnecessaryEscapes | FlagEncodeNecessaryEscapes | FlagRemoveDefaultPort | FlagRemoveEmptyQuerySeparator
102
103	// For convenience sets, "greedy" uses the "remove trailing slash" and "remove www. prefix" flags,
104	// while "non-greedy" uses the "add (or keep) the trailing slash" and "add www. prefix".
105
106	// Convenience set of usually safe normalizations (includes FlagsSafe)
107	FlagsUsuallySafeGreedy    NormalizationFlags = FlagsSafe | FlagRemoveTrailingSlash | FlagRemoveDotSegments
108	FlagsUsuallySafeNonGreedy NormalizationFlags = FlagsSafe | FlagAddTrailingSlash | FlagRemoveDotSegments
109
110	// Convenience set of unsafe normalizations (includes FlagsUsuallySafe)
111	FlagsUnsafeGreedy    NormalizationFlags = FlagsUsuallySafeGreedy | FlagRemoveDirectoryIndex | FlagRemoveFragment | FlagForceHTTP | FlagRemoveDuplicateSlashes | FlagRemoveWWW | FlagSortQuery
112	FlagsUnsafeNonGreedy NormalizationFlags = FlagsUsuallySafeNonGreedy | FlagRemoveDirectoryIndex | FlagRemoveFragment | FlagForceHTTP | FlagRemoveDuplicateSlashes | FlagAddWWW | FlagSortQuery
113
114	// Convenience set of all available flags
115	FlagsAllGreedy    = FlagsUnsafeGreedy | FlagDecodeDWORDHost | FlagDecodeOctalHost | FlagDecodeHexHost | FlagRemoveUnnecessaryHostDots | FlagRemoveEmptyPortSeparator
116	FlagsAllNonGreedy = FlagsUnsafeNonGreedy | FlagDecodeDWORDHost | FlagDecodeOctalHost | FlagDecodeHexHost | FlagRemoveUnnecessaryHostDots | FlagRemoveEmptyPortSeparator
117)
118```
119
120For convenience, the set of flags `FlagsSafe`, `FlagsUsuallySafe[Greedy|NonGreedy]`, `FlagsUnsafe[Greedy|NonGreedy]` and `FlagsAll[Greedy|NonGreedy]` are provided for the similarly grouped normalizations on [wikipedia's URL normalization page][wiki]. You can add (using the bitwise OR `|` operator) or remove (using the bitwise AND NOT `&^` operator) individual flags from the sets if required, to build your own custom set.
121
122The [full godoc reference is available on gopkgdoc][godoc].
123
124Some things to note:
125
126*    `FlagDecodeUnnecessaryEscapes`, `FlagEncodeNecessaryEscapes`, `FlagUppercaseEscapes` and `FlagRemoveEmptyQuerySeparator` are always implicitly set, because internally, the URL string is parsed as an URL object, which automatically decodes unnecessary escapes, uppercases and encodes necessary ones, and removes empty query separators (an unnecessary `?` at the end of the url). So this operation cannot **not** be done. For this reason, `FlagRemoveEmptyQuerySeparator` (as well as the other three) has been included in the `FlagsSafe` convenience set, instead of `FlagsUnsafe`, where Wikipedia puts it.
127
128*    The `FlagDecodeUnnecessaryEscapes` decodes the following escapes (*from -> to*):
129    -    %24 -> $
130    -    %26 -> &
131    -    %2B-%3B -> +,-./0123456789:;
132    -    %3D -> =
133    -    %40-%5A -> @ABCDEFGHIJKLMNOPQRSTUVWXYZ
134    -    %5F -> _
135    -    %61-%7A -> abcdefghijklmnopqrstuvwxyz
136    -    %7E -> ~
137
138
139*    When the `NormalizeURL` function is used (passing an URL object), this source URL object is modified (that is, after the call, the URL object will be modified to reflect the normalization).
140
141*    The *replace IP with domain name* normalization (`http://208.77.188.166/http://www.example.com/`) is obviously not possible for a library without making some network requests. This is not implemented in purell.
142
143*    The *remove unused query string parameters* and *remove default query parameters* are also not implemented, since this is a very case-specific normalization, and it is quite trivial to do with an URL object.
144
145### Safe vs Usually Safe vs Unsafe
146
147Purell allows you to control the level of risk you take while normalizing an URL. You can aggressively normalize, play it totally safe, or anything in between.
148
149Consider the following URL:
150
151`HTTPS://www.RooT.com/toto/t%45%1f///a/./b/../c/?z=3&w=2&a=4&w=1#invalid`
152
153Normalizing with the `FlagsSafe` gives:
154
155`https://www.root.com/toto/tE%1F///a/./b/../c/?z=3&w=2&a=4&w=1#invalid`
156
157With the `FlagsUsuallySafeGreedy`:
158
159`https://www.root.com/toto/tE%1F///a/c?z=3&w=2&a=4&w=1#invalid`
160
161And with `FlagsUnsafeGreedy`:
162
163`http://root.com/toto/tE%1F/a/c?a=4&w=1&w=2&z=3`
164
165## TODOs
166
167*    Add a class/default instance to allow specifying custom directory index names? At the moment, removing directory index removes `(^|/)((?:default|index)\.\w{1,4})$`.
168
169## Thanks / Contributions
170
171@rogpeppe
172@jehiah
173@opennota
174@pchristopher1275
175@zenovich
176@beeker1121
177
178## License
179
180The [BSD 3-Clause license][bsd].
181
182[bsd]: http://opensource.org/licenses/BSD-3-Clause
183[wiki]: http://en.wikipedia.org/wiki/URL_normalization
184[rfc]: http://tools.ietf.org/html/rfc3986#section-6
185[godoc]: http://go.pkgdoc.org/github.com/PuerkitoBio/purell
186[pr5]: https://github.com/PuerkitoBio/purell/pull/5
187[iss7]: https://github.com/PuerkitoBio/purell/issues/7
188