1<title>Defense Against Spiders</title>
2
3The website presented by a Fossil server has many hyperlinks.
4Even a modest project can have millions of pages in its
5tree, and many of those pages (for example diffs and annotations
6and ZIP archives of older check-ins) can be expensive to compute.
7If a spider or bot tries to walk a website implemented by
8Fossil, it can present a crippling bandwidth and CPU load.
9
10The website presented by a Fossil server is intended to be used
11interactively by humans, not walked by spiders.  This article
12describes the techniques used by Fossil to try to welcome human
13users while keeping out spiders.
14
15<h2>The Hyperlink User Capability</h2>
16
17Every Fossil web session has a "user".  For random passers-by on the internet
18(and for spiders) that user is "nobody".  The "anonymous" user is also
19available for humans who do not wish to identify themselves.  The difference
20is that "anonymous" requires a login (using a password supplied via
21a CAPTCHA) whereas "nobody" does not require a login.
22The site administrator can also create logins with
23passwords for specific individuals.
24
25Users without the <b>[./caps/ref.html#h | Hyperlink]</b> capability
26do not see most Fossil-generated hyperlinks. This is
27a simple defense against spiders, since [./caps/#ucat | the "nobody"
28user category] does not have this capability by default.
29Users must log in (perhaps as
30"anonymous") before they can see any of the hyperlinks.  A spider
31that cannot log into your Fossil repository will be unable to walk
32its historical check-ins, create diffs between versions, pull zip
33archives, etc. by visiting links, because they aren't there.
34
35A text message appears at the top of each page in this situation to
36invite humans to log in as anonymous in order to activate hyperlinks.
37
38Because this required login step is annoying to some,
39Fossil provides other techniques for blocking spiders which
40are less cumbersome to humans.
41
42<h2>Automatic Hyperlinks Based on UserAgent</h2>
43
44Fossil has the ability to selectively enable hyperlinks for users
45that lack the <b>Hyperlink</b> capability based on their UserAgent string in the
46HTTP request header and on the browsers ability to run Javascript.
47
48The UserAgent string is a text identifier that is included in the header
49of most HTTP requests that identifies the specific maker and version of
50the browser (or spider) that generated the request.  Typical UserAgent
51strings look like this:
52
53<ul>
54<li> Mozilla/5.0 (Windows NT 6.1; rv:19.0) Gecko/20100101 Firefox/19.0
55<li> Mozilla/4.0 (compatible; MSIE 8.0; Windows_NT 5.1; Trident/4.0)
56<li> Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
57<li> Wget/1.12 (openbsd4.9)
58</ul>
59
60The first two UserAgent strings above identify Firefox 19 and
61Internet Explorer 8.0, both running on Windows NT.  The third
62example is the spider used by Google to index the internet.
63The fourth example is the "wget" utility running on OpenBSD.
64Thus the first two UserAgent strings above identify the requester
65as human whereas the second two identify the requester as a spider.
66Note that the UserAgent string is completely under the control
67of the requester and so a malicious spider can forge a UserAgent
68string that makes it look like a human.  But most spiders truly
69seem to desire to "play nicely" on the internet and are quite open
70about the fact that they are a spider.  And so the UserAgent string
71provides a good first-guess about whether or not a request originates
72from a human or a spider.
73
74In Fossil, under the Admin/Access menu, there is a setting entitled
75"<b>Enable hyperlinks for "nobody" based on User-Agent and Javascript</b>".
76If this setting is enabled, and if the UserAgent string looks like a
77human and not a spider, then Fossil will enable hyperlinks even if
78the <b>Hyperlink</b> capability is omitted from the user permissions.  This setting
79gives humans easy access to the hyperlinks while preventing spiders
80from walking the millions of pages on a typical Fossil site.
81
82But the hyperlinks are not enabled directly with the setting above.
83Instead, the HTML code that is generated contains anchor tags ("&lt;a&gt;")
84without "href=" attributes.  Then, JavaScript code is added to the
85end of the page that goes back and fills in the "href=" attributes of
86the anchor tags with the hyperlink targets, thus enabling the hyperlinks.
87This extra step of using JavaScript to enable the hyperlink targets
88is a security measure against spiders that forge a human-looking
89UserAgent string.  Most spiders do not bother to run JavaScript and
90so to the spider the empty anchor tag will be useless.  But all modern
91web browsers implement JavaScript, so hyperlinks will show up
92normally for human users.
93
94<h2>Further Defenses</h2>
95
96Recently (as of this writing, in the spring of 2013) the Fossil server
97on the SQLite website ([http://www.sqlite.org/src/]) has been hit repeatedly
98by Chinese spiders that use forged UserAgent strings to make them look
99like normal web browsers and which interpret JavaScript.  We do not
100believe these attacks to be nefarious since SQLite is public domain
101and the attackers could obtain all information they ever wanted to
102know about SQLite simply by cloning the repository.  Instead, we
103believe these "attacks" are coming from "script kiddies".  But regardless
104of whether or not malice is involved, these attacks do present
105an unnecessary load on the server which reduces the responsiveness of
106the SQLite website for well-behaved and socially responsible users.
107For this reason, additional defenses against
108spiders have been put in place.
109
110On the Admin/Access page of Fossil, just below the
111"<b>Enable hyperlinks for "nobody" based on User-Agent and Javascript</b>"
112setting, there are now two additional sub-settings that can be optionally
113enabled to control hyperlinks.
114
115The first sub-setting waits to run the
116JavaScript that sets the "href=" attributes on anchor tags until after
117at least one "mouseover" event has been detected on the &lt;body&gt;
118element of the page.  The thinking here is that spiders will not be
119simulating mouse motion and so no mouseover events will ever occur and
120hence the hyperlinks will never become enabled for spiders.
121
122The second new sub-setting is a delay (in milliseconds) before setting
123the "href=" attributes on anchor tags.  The default value for this
124delay is 10 milliseconds.  The idea here is that a spider will try to
125render the page immediately, and will not wait for delayed scripts
126to be run, thus will never enable the hyperlinks.
127
128These two sub-settings can be used separately or together.  If used together,
129then the delay timer does not start until after the first mouse movement
130is detected.
131
132See also [./loadmgmt.md|Managing Server Load] for a description
133of how expensive pages can be disabled when the server is under heavy
134load.
135
136<h2>The Ongoing Struggle</h2>
137
138Fossil currently does a very good job of providing easy access to humans
139while keeping out troublesome robots and spiders.  However, spiders and
140bots continue to grow more sophisticated, requiring ever more advanced
141defenses.  This "arms race" is unlikely to ever end.  The developers of
142Fossil will continue to try improve the spider defenses of Fossil so
143check back from time to time for the latest releases and updates.
144
145Readers of this page who have suggestions on how to improve the spider
146defenses in Fossil are invited to submit your ideas to the Fossil Users
147forum:
148[https://fossil-scm.org/forum].
149