1<title>Defense Against Spiders</title> 2 3The website presented by a Fossil server has many hyperlinks. 4Even a modest project can have millions of pages in its 5tree, and many of those pages (for example diffs and annotations 6and ZIP archives of older check-ins) can be expensive to compute. 7If a spider or bot tries to walk a website implemented by 8Fossil, it can present a crippling bandwidth and CPU load. 9 10The website presented by a Fossil server is intended to be used 11interactively by humans, not walked by spiders. This article 12describes the techniques used by Fossil to try to welcome human 13users while keeping out spiders. 14 15<h2>The Hyperlink User Capability</h2> 16 17Every Fossil web session has a "user". For random passers-by on the internet 18(and for spiders) that user is "nobody". The "anonymous" user is also 19available for humans who do not wish to identify themselves. The difference 20is that "anonymous" requires a login (using a password supplied via 21a CAPTCHA) whereas "nobody" does not require a login. 22The site administrator can also create logins with 23passwords for specific individuals. 24 25Users without the <b>[./caps/ref.html#h | Hyperlink]</b> capability 26do not see most Fossil-generated hyperlinks. This is 27a simple defense against spiders, since [./caps/#ucat | the "nobody" 28user category] does not have this capability by default. 29Users must log in (perhaps as 30"anonymous") before they can see any of the hyperlinks. A spider 31that cannot log into your Fossil repository will be unable to walk 32its historical check-ins, create diffs between versions, pull zip 33archives, etc. by visiting links, because they aren't there. 34 35A text message appears at the top of each page in this situation to 36invite humans to log in as anonymous in order to activate hyperlinks. 37 38Because this required login step is annoying to some, 39Fossil provides other techniques for blocking spiders which 40are less cumbersome to humans. 41 42<h2>Automatic Hyperlinks Based on UserAgent</h2> 43 44Fossil has the ability to selectively enable hyperlinks for users 45that lack the <b>Hyperlink</b> capability based on their UserAgent string in the 46HTTP request header and on the browsers ability to run Javascript. 47 48The UserAgent string is a text identifier that is included in the header 49of most HTTP requests that identifies the specific maker and version of 50the browser (or spider) that generated the request. Typical UserAgent 51strings look like this: 52 53<ul> 54<li> Mozilla/5.0 (Windows NT 6.1; rv:19.0) Gecko/20100101 Firefox/19.0 55<li> Mozilla/4.0 (compatible; MSIE 8.0; Windows_NT 5.1; Trident/4.0) 56<li> Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) 57<li> Wget/1.12 (openbsd4.9) 58</ul> 59 60The first two UserAgent strings above identify Firefox 19 and 61Internet Explorer 8.0, both running on Windows NT. The third 62example is the spider used by Google to index the internet. 63The fourth example is the "wget" utility running on OpenBSD. 64Thus the first two UserAgent strings above identify the requester 65as human whereas the second two identify the requester as a spider. 66Note that the UserAgent string is completely under the control 67of the requester and so a malicious spider can forge a UserAgent 68string that makes it look like a human. But most spiders truly 69seem to desire to "play nicely" on the internet and are quite open 70about the fact that they are a spider. And so the UserAgent string 71provides a good first-guess about whether or not a request originates 72from a human or a spider. 73 74In Fossil, under the Admin/Access menu, there is a setting entitled 75"<b>Enable hyperlinks for "nobody" based on User-Agent and Javascript</b>". 76If this setting is enabled, and if the UserAgent string looks like a 77human and not a spider, then Fossil will enable hyperlinks even if 78the <b>Hyperlink</b> capability is omitted from the user permissions. This setting 79gives humans easy access to the hyperlinks while preventing spiders 80from walking the millions of pages on a typical Fossil site. 81 82But the hyperlinks are not enabled directly with the setting above. 83Instead, the HTML code that is generated contains anchor tags ("<a>") 84without "href=" attributes. Then, JavaScript code is added to the 85end of the page that goes back and fills in the "href=" attributes of 86the anchor tags with the hyperlink targets, thus enabling the hyperlinks. 87This extra step of using JavaScript to enable the hyperlink targets 88is a security measure against spiders that forge a human-looking 89UserAgent string. Most spiders do not bother to run JavaScript and 90so to the spider the empty anchor tag will be useless. But all modern 91web browsers implement JavaScript, so hyperlinks will show up 92normally for human users. 93 94<h2>Further Defenses</h2> 95 96Recently (as of this writing, in the spring of 2013) the Fossil server 97on the SQLite website ([http://www.sqlite.org/src/]) has been hit repeatedly 98by Chinese spiders that use forged UserAgent strings to make them look 99like normal web browsers and which interpret JavaScript. We do not 100believe these attacks to be nefarious since SQLite is public domain 101and the attackers could obtain all information they ever wanted to 102know about SQLite simply by cloning the repository. Instead, we 103believe these "attacks" are coming from "script kiddies". But regardless 104of whether or not malice is involved, these attacks do present 105an unnecessary load on the server which reduces the responsiveness of 106the SQLite website for well-behaved and socially responsible users. 107For this reason, additional defenses against 108spiders have been put in place. 109 110On the Admin/Access page of Fossil, just below the 111"<b>Enable hyperlinks for "nobody" based on User-Agent and Javascript</b>" 112setting, there are now two additional sub-settings that can be optionally 113enabled to control hyperlinks. 114 115The first sub-setting waits to run the 116JavaScript that sets the "href=" attributes on anchor tags until after 117at least one "mouseover" event has been detected on the <body> 118element of the page. The thinking here is that spiders will not be 119simulating mouse motion and so no mouseover events will ever occur and 120hence the hyperlinks will never become enabled for spiders. 121 122The second new sub-setting is a delay (in milliseconds) before setting 123the "href=" attributes on anchor tags. The default value for this 124delay is 10 milliseconds. The idea here is that a spider will try to 125render the page immediately, and will not wait for delayed scripts 126to be run, thus will never enable the hyperlinks. 127 128These two sub-settings can be used separately or together. If used together, 129then the delay timer does not start until after the first mouse movement 130is detected. 131 132See also [./loadmgmt.md|Managing Server Load] for a description 133of how expensive pages can be disabled when the server is under heavy 134load. 135 136<h2>The Ongoing Struggle</h2> 137 138Fossil currently does a very good job of providing easy access to humans 139while keeping out troublesome robots and spiders. However, spiders and 140bots continue to grow more sophisticated, requiring ever more advanced 141defenses. This "arms race" is unlikely to ever end. The developers of 142Fossil will continue to try improve the spider defenses of Fossil so 143check back from time to time for the latest releases and updates. 144 145Readers of this page who have suggestions on how to improve the spider 146defenses in Fossil are invited to submit your ideas to the Fossil Users 147forum: 148[https://fossil-scm.org/forum]. 149