1.. _topics-practices: 2 3================ 4Common Practices 5================ 6 7This section documents common practices when using Scrapy. These are things 8that cover many topics and don't often fall into any other specific section. 9 10.. _run-from-script: 11 12Run Scrapy from a script 13======================== 14 15You can use the :ref:`API <topics-api>` to run Scrapy from a script, instead of 16the typical way of running Scrapy via ``scrapy crawl``. 17 18Remember that Scrapy is built on top of the Twisted 19asynchronous networking library, so you need to run it inside the Twisted reactor. 20 21The first utility you can use to run your spiders is 22:class:`scrapy.crawler.CrawlerProcess`. This class will start a Twisted reactor 23for you, configuring the logging and setting shutdown handlers. This class is 24the one used by all Scrapy commands. 25 26Here's an example showing how to run a single spider with it. 27 28:: 29 30 import scrapy 31 from scrapy.crawler import CrawlerProcess 32 33 class MySpider(scrapy.Spider): 34 # Your spider definition 35 ... 36 37 process = CrawlerProcess(settings={ 38 "FEEDS": { 39 "items.json": {"format": "json"}, 40 }, 41 }) 42 43 process.crawl(MySpider) 44 process.start() # the script will block here until the crawling is finished 45 46Define settings within dictionary in CrawlerProcess. Make sure to check :class:`~scrapy.crawler.CrawlerProcess` 47documentation to get acquainted with its usage details. 48 49If you are inside a Scrapy project there are some additional helpers you can 50use to import those components within the project. You can automatically import 51your spiders passing their name to :class:`~scrapy.crawler.CrawlerProcess`, and 52use ``get_project_settings`` to get a :class:`~scrapy.settings.Settings` 53instance with your project settings. 54 55What follows is a working example of how to do that, using the `testspiders`_ 56project as example. 57 58:: 59 60 from scrapy.crawler import CrawlerProcess 61 from scrapy.utils.project import get_project_settings 62 63 process = CrawlerProcess(get_project_settings()) 64 65 # 'followall' is the name of one of the spiders of the project. 66 process.crawl('followall', domain='scrapy.org') 67 process.start() # the script will block here until the crawling is finished 68 69There's another Scrapy utility that provides more control over the crawling 70process: :class:`scrapy.crawler.CrawlerRunner`. This class is a thin wrapper 71that encapsulates some simple helpers to run multiple crawlers, but it won't 72start or interfere with existing reactors in any way. 73 74Using this class the reactor should be explicitly run after scheduling your 75spiders. It's recommended you use :class:`~scrapy.crawler.CrawlerRunner` 76instead of :class:`~scrapy.crawler.CrawlerProcess` if your application is 77already using Twisted and you want to run Scrapy in the same reactor. 78 79Note that you will also have to shutdown the Twisted reactor yourself after the 80spider is finished. This can be achieved by adding callbacks to the deferred 81returned by the :meth:`CrawlerRunner.crawl 82<scrapy.crawler.CrawlerRunner.crawl>` method. 83 84Here's an example of its usage, along with a callback to manually stop the 85reactor after ``MySpider`` has finished running. 86 87:: 88 89 from twisted.internet import reactor 90 import scrapy 91 from scrapy.crawler import CrawlerRunner 92 from scrapy.utils.log import configure_logging 93 94 class MySpider(scrapy.Spider): 95 # Your spider definition 96 ... 97 98 configure_logging({'LOG_FORMAT': '%(levelname)s: %(message)s'}) 99 runner = CrawlerRunner() 100 101 d = runner.crawl(MySpider) 102 d.addBoth(lambda _: reactor.stop()) 103 reactor.run() # the script will block here until the crawling is finished 104 105.. seealso:: :doc:`twisted:core/howto/reactor-basics` 106 107.. _run-multiple-spiders: 108 109Running multiple spiders in the same process 110============================================ 111 112By default, Scrapy runs a single spider per process when you run ``scrapy 113crawl``. However, Scrapy supports running multiple spiders per process using 114the :ref:`internal API <topics-api>`. 115 116Here is an example that runs multiple spiders simultaneously: 117 118:: 119 120 import scrapy 121 from scrapy.crawler import CrawlerProcess 122 123 class MySpider1(scrapy.Spider): 124 # Your first spider definition 125 ... 126 127 class MySpider2(scrapy.Spider): 128 # Your second spider definition 129 ... 130 131 process = CrawlerProcess() 132 process.crawl(MySpider1) 133 process.crawl(MySpider2) 134 process.start() # the script will block here until all crawling jobs are finished 135 136Same example using :class:`~scrapy.crawler.CrawlerRunner`: 137 138:: 139 140 import scrapy 141 from twisted.internet import reactor 142 from scrapy.crawler import CrawlerRunner 143 from scrapy.utils.log import configure_logging 144 145 class MySpider1(scrapy.Spider): 146 # Your first spider definition 147 ... 148 149 class MySpider2(scrapy.Spider): 150 # Your second spider definition 151 ... 152 153 configure_logging() 154 runner = CrawlerRunner() 155 runner.crawl(MySpider1) 156 runner.crawl(MySpider2) 157 d = runner.join() 158 d.addBoth(lambda _: reactor.stop()) 159 160 reactor.run() # the script will block here until all crawling jobs are finished 161 162Same example but running the spiders sequentially by chaining the deferreds: 163 164:: 165 166 from twisted.internet import reactor, defer 167 from scrapy.crawler import CrawlerRunner 168 from scrapy.utils.log import configure_logging 169 170 class MySpider1(scrapy.Spider): 171 # Your first spider definition 172 ... 173 174 class MySpider2(scrapy.Spider): 175 # Your second spider definition 176 ... 177 178 configure_logging() 179 runner = CrawlerRunner() 180 181 @defer.inlineCallbacks 182 def crawl(): 183 yield runner.crawl(MySpider1) 184 yield runner.crawl(MySpider2) 185 reactor.stop() 186 187 crawl() 188 reactor.run() # the script will block here until the last crawl call is finished 189 190.. seealso:: :ref:`run-from-script`. 191 192.. _distributed-crawls: 193 194Distributed crawls 195================== 196 197Scrapy doesn't provide any built-in facility for running crawls in a distribute 198(multi-server) manner. However, there are some ways to distribute crawls, which 199vary depending on how you plan to distribute them. 200 201If you have many spiders, the obvious way to distribute the load is to setup 202many Scrapyd instances and distribute spider runs among those. 203 204If you instead want to run a single (big) spider through many machines, what 205you usually do is partition the urls to crawl and send them to each separate 206spider. Here is a concrete example: 207 208First, you prepare the list of urls to crawl and put them into separate 209files/urls:: 210 211 http://somedomain.com/urls-to-crawl/spider1/part1.list 212 http://somedomain.com/urls-to-crawl/spider1/part2.list 213 http://somedomain.com/urls-to-crawl/spider1/part3.list 214 215Then you fire a spider run on 3 different Scrapyd servers. The spider would 216receive a (spider) argument ``part`` with the number of the partition to 217crawl:: 218 219 curl http://scrapy1.mycompany.com:6800/schedule.json -d project=myproject -d spider=spider1 -d part=1 220 curl http://scrapy2.mycompany.com:6800/schedule.json -d project=myproject -d spider=spider1 -d part=2 221 curl http://scrapy3.mycompany.com:6800/schedule.json -d project=myproject -d spider=spider1 -d part=3 222 223.. _bans: 224 225Avoiding getting banned 226======================= 227 228Some websites implement certain measures to prevent bots from crawling them, 229with varying degrees of sophistication. Getting around those measures can be 230difficult and tricky, and may sometimes require special infrastructure. Please 231consider contacting `commercial support`_ if in doubt. 232 233Here are some tips to keep in mind when dealing with these kinds of sites: 234 235* rotate your user agent from a pool of well-known ones from browsers (google 236 around to get a list of them) 237* disable cookies (see :setting:`COOKIES_ENABLED`) as some sites may use 238 cookies to spot bot behaviour 239* use download delays (2 or higher). See :setting:`DOWNLOAD_DELAY` setting. 240* if possible, use `Google cache`_ to fetch pages, instead of hitting the sites 241 directly 242* use a pool of rotating IPs. For example, the free `Tor project`_ or paid 243 services like `ProxyMesh`_. An open source alternative is `scrapoxy`_, a 244 super proxy that you can attach your own proxies to. 245* use a highly distributed downloader that circumvents bans internally, so you 246 can just focus on parsing clean pages. One example of such downloaders is 247 `Zyte Smart Proxy Manager`_ 248 249If you are still unable to prevent your bot getting banned, consider contacting 250`commercial support`_. 251 252.. _Tor project: https://www.torproject.org/ 253.. _commercial support: https://scrapy.org/support/ 254.. _ProxyMesh: https://proxymesh.com/ 255.. _Google cache: http://www.googleguide.com/cached_pages.html 256.. _testspiders: https://github.com/scrapinghub/testspiders 257.. _scrapoxy: https://scrapoxy.io/ 258.. _Zyte Smart Proxy Manager: https://www.zyte.com/smart-proxy-manager/ 259