• Home
  • History
  • Annotate
Name Date Size #Lines LOC

..03-May-2022-

purell-1.1.1/H16-Feb-2019-

.gitignoreH A D06-Oct-202039

.travis.ymlH A D06-Oct-2020130

LICENSEH A D06-Oct-20201.4 KiB

README.mdH A D06-Oct-20209.2 KiB

purell.goH A D06-Oct-202012.3 KiB

README.md

1# Purell
2
3Purell is a tiny Go library to normalize URLs. It returns a pure URL. Pure-ell. Sanitizer and all. Yeah, I know...
4
5Based on the [wikipedia paper][wiki] and the [RFC 3986 document][rfc].
6
7[![build status](https://travis-ci.org/PuerkitoBio/purell.svg?branch=master)](http://travis-ci.org/PuerkitoBio/purell)
8
9## Install
10
11`go get github.com/PuerkitoBio/purell`
12
13## Changelog
14
15*    **v1.1.1** : Fix failing test due to Go1.12 changes (thanks to @ianlancetaylor).
16*    **2016-11-14 (v1.1.0)** : IDN: Conform to RFC 5895: Fold character width (thanks to @beeker1121).
17*    **2016-07-27 (v1.0.0)** : Normalize IDN to ASCII (thanks to @zenovich).
18*    **2015-02-08** : Add fix for relative paths issue ([PR #5][pr5]) and add fix for unnecessary encoding of reserved characters ([see issue #7][iss7]).
19*    **v0.2.0** : Add benchmarks, Attempt IDN support.
20*    **v0.1.0** : Initial release.
21
22## Examples
23
24From `example_test.go` (note that in your code, you would import "github.com/PuerkitoBio/purell", and would prefix references to its methods and constants with "purell."):
25
26```go
27package purell
28
29import (
30  "fmt"
31  "net/url"
32)
33
34func ExampleNormalizeURLString() {
35  if normalized, err := NormalizeURLString("hTTp://someWEBsite.com:80/Amazing%3f/url/",
36    FlagLowercaseScheme|FlagLowercaseHost|FlagUppercaseEscapes); err != nil {
37    panic(err)
38  } else {
39    fmt.Print(normalized)
40  }
41  // Output: http://somewebsite.com:80/Amazing%3F/url/
42}
43
44func ExampleMustNormalizeURLString() {
45  normalized := MustNormalizeURLString("hTTpS://someWEBsite.com:443/Amazing%fa/url/",
46    FlagsUnsafeGreedy)
47  fmt.Print(normalized)
48
49  // Output: http://somewebsite.com/Amazing%FA/url
50}
51
52func ExampleNormalizeURL() {
53  if u, err := url.Parse("Http://SomeUrl.com:8080/a/b/.././c///g?c=3&a=1&b=9&c=0#target"); err != nil {
54    panic(err)
55  } else {
56    normalized := NormalizeURL(u, FlagsUsuallySafeGreedy|FlagRemoveDuplicateSlashes|FlagRemoveFragment)
57    fmt.Print(normalized)
58  }
59
60  // Output: http://someurl.com:8080/a/c/g?c=3&a=1&b=9&c=0
61}
62```
63
64## API
65
66As seen in the examples above, purell offers three methods, `NormalizeURLString(string, NormalizationFlags) (string, error)`, `MustNormalizeURLString(string, NormalizationFlags) (string)` and `NormalizeURL(*url.URL, NormalizationFlags) (string)`. They all normalize the provided URL based on the specified flags. Here are the available flags:
67
68```go
69const (
70	// Safe normalizations
71	FlagLowercaseScheme           NormalizationFlags = 1 << iota // HTTP://host -> http://host, applied by default in Go1.1
72	FlagLowercaseHost                                            // http://HOST -> http://host
73	FlagUppercaseEscapes                                         // http://host/t%ef -> http://host/t%EF
74	FlagDecodeUnnecessaryEscapes                                 // http://host/t%41 -> http://host/tA
75	FlagEncodeNecessaryEscapes                                   // http://host/!"#$ -> http://host/%21%22#$
76	FlagRemoveDefaultPort                                        // http://host:80 -> http://host
77	FlagRemoveEmptyQuerySeparator                                // http://host/path? -> http://host/path
78
79	// Usually safe normalizations
80	FlagRemoveTrailingSlash // http://host/path/ -> http://host/path
81	FlagAddTrailingSlash    // http://host/path -> http://host/path/ (should choose only one of these add/remove trailing slash flags)
82	FlagRemoveDotSegments   // http://host/path/./a/b/../c -> http://host/path/a/c
83
84	// Unsafe normalizations
85	FlagRemoveDirectoryIndex   // http://host/path/index.html -> http://host/path/
86	FlagRemoveFragment         // http://host/path#fragment -> http://host/path
87	FlagForceHTTP              // https://host -> http://host
88	FlagRemoveDuplicateSlashes // http://host/path//a///b -> http://host/path/a/b
89	FlagRemoveWWW              // http://www.host/ -> http://host/
90	FlagAddWWW                 // http://host/ -> http://www.host/ (should choose only one of these add/remove WWW flags)
91	FlagSortQuery              // http://host/path?c=3&b=2&a=1&b=1 -> http://host/path?a=1&b=1&b=2&c=3
92
93	// Normalizations not in the wikipedia article, required to cover tests cases
94	// submitted by jehiah
95	FlagDecodeDWORDHost           // http://1113982867 -> http://66.102.7.147
96	FlagDecodeOctalHost           // http://0102.0146.07.0223 -> http://66.102.7.147
97	FlagDecodeHexHost             // http://0x42660793 -> http://66.102.7.147
98	FlagRemoveUnnecessaryHostDots // http://.host../path -> http://host/path
99	FlagRemoveEmptyPortSeparator  // http://host:/path -> http://host/path
100
101	// Convenience set of safe normalizations
102	FlagsSafe NormalizationFlags = FlagLowercaseHost | FlagLowercaseScheme | FlagUppercaseEscapes | FlagDecodeUnnecessaryEscapes | FlagEncodeNecessaryEscapes | FlagRemoveDefaultPort | FlagRemoveEmptyQuerySeparator
103
104	// For convenience sets, "greedy" uses the "remove trailing slash" and "remove www. prefix" flags,
105	// while "non-greedy" uses the "add (or keep) the trailing slash" and "add www. prefix".
106
107	// Convenience set of usually safe normalizations (includes FlagsSafe)
108	FlagsUsuallySafeGreedy    NormalizationFlags = FlagsSafe | FlagRemoveTrailingSlash | FlagRemoveDotSegments
109	FlagsUsuallySafeNonGreedy NormalizationFlags = FlagsSafe | FlagAddTrailingSlash | FlagRemoveDotSegments
110
111	// Convenience set of unsafe normalizations (includes FlagsUsuallySafe)
112	FlagsUnsafeGreedy    NormalizationFlags = FlagsUsuallySafeGreedy | FlagRemoveDirectoryIndex | FlagRemoveFragment | FlagForceHTTP | FlagRemoveDuplicateSlashes | FlagRemoveWWW | FlagSortQuery
113	FlagsUnsafeNonGreedy NormalizationFlags = FlagsUsuallySafeNonGreedy | FlagRemoveDirectoryIndex | FlagRemoveFragment | FlagForceHTTP | FlagRemoveDuplicateSlashes | FlagAddWWW | FlagSortQuery
114
115	// Convenience set of all available flags
116	FlagsAllGreedy    = FlagsUnsafeGreedy | FlagDecodeDWORDHost | FlagDecodeOctalHost | FlagDecodeHexHost | FlagRemoveUnnecessaryHostDots | FlagRemoveEmptyPortSeparator
117	FlagsAllNonGreedy = FlagsUnsafeNonGreedy | FlagDecodeDWORDHost | FlagDecodeOctalHost | FlagDecodeHexHost | FlagRemoveUnnecessaryHostDots | FlagRemoveEmptyPortSeparator
118)
119```
120
121For convenience, the set of flags `FlagsSafe`, `FlagsUsuallySafe[Greedy|NonGreedy]`, `FlagsUnsafe[Greedy|NonGreedy]` and `FlagsAll[Greedy|NonGreedy]` are provided for the similarly grouped normalizations on [wikipedia's URL normalization page][wiki]. You can add (using the bitwise OR `|` operator) or remove (using the bitwise AND NOT `&^` operator) individual flags from the sets if required, to build your own custom set.
122
123The [full godoc reference is available on gopkgdoc][godoc].
124
125Some things to note:
126
127*    `FlagDecodeUnnecessaryEscapes`, `FlagEncodeNecessaryEscapes`, `FlagUppercaseEscapes` and `FlagRemoveEmptyQuerySeparator` are always implicitly set, because internally, the URL string is parsed as an URL object, which automatically decodes unnecessary escapes, uppercases and encodes necessary ones, and removes empty query separators (an unnecessary `?` at the end of the url). So this operation cannot **not** be done. For this reason, `FlagRemoveEmptyQuerySeparator` (as well as the other three) has been included in the `FlagsSafe` convenience set, instead of `FlagsUnsafe`, where Wikipedia puts it.
128
129*    The `FlagDecodeUnnecessaryEscapes` decodes the following escapes (*from -> to*):
130    -    %24 -> $
131    -    %26 -> &
132    -    %2B-%3B -> +,-./0123456789:;
133    -    %3D -> =
134    -    %40-%5A -> @ABCDEFGHIJKLMNOPQRSTUVWXYZ
135    -    %5F -> _
136    -    %61-%7A -> abcdefghijklmnopqrstuvwxyz
137    -    %7E -> ~
138
139
140*    When the `NormalizeURL` function is used (passing an URL object), this source URL object is modified (that is, after the call, the URL object will be modified to reflect the normalization).
141
142*    The *replace IP with domain name* normalization (`http://208.77.188.166/http://www.example.com/`) is obviously not possible for a library without making some network requests. This is not implemented in purell.
143
144*    The *remove unused query string parameters* and *remove default query parameters* are also not implemented, since this is a very case-specific normalization, and it is quite trivial to do with an URL object.
145
146### Safe vs Usually Safe vs Unsafe
147
148Purell allows you to control the level of risk you take while normalizing an URL. You can aggressively normalize, play it totally safe, or anything in between.
149
150Consider the following URL:
151
152`HTTPS://www.RooT.com/toto/t%45%1f///a/./b/../c/?z=3&w=2&a=4&w=1#invalid`
153
154Normalizing with the `FlagsSafe` gives:
155
156`https://www.root.com/toto/tE%1F///a/./b/../c/?z=3&w=2&a=4&w=1#invalid`
157
158With the `FlagsUsuallySafeGreedy`:
159
160`https://www.root.com/toto/tE%1F///a/c?z=3&w=2&a=4&w=1#invalid`
161
162And with `FlagsUnsafeGreedy`:
163
164`http://root.com/toto/tE%1F/a/c?a=4&w=1&w=2&z=3`
165
166## TODOs
167
168*    Add a class/default instance to allow specifying custom directory index names? At the moment, removing directory index removes `(^|/)((?:default|index)\.\w{1,4})$`.
169
170## Thanks / Contributions
171
172@rogpeppe
173@jehiah
174@opennota
175@pchristopher1275
176@zenovich
177@beeker1121
178
179## License
180
181The [BSD 3-Clause license][bsd].
182
183[bsd]: http://opensource.org/licenses/BSD-3-Clause
184[wiki]: http://en.wikipedia.org/wiki/URL_normalization
185[rfc]: http://tools.ietf.org/html/rfc3986#section-6
186[godoc]: http://go.pkgdoc.org/github.com/PuerkitoBio/purell
187[pr5]: https://github.com/PuerkitoBio/purell/pull/5
188[iss7]: https://github.com/PuerkitoBio/purell/issues/7
189