1.. _topics-practices:
2
3================
4Common Practices
5================
6
7This section documents common practices when using Scrapy. These are things
8that cover many topics and don't often fall into any other specific section.
9
10.. _run-from-script:
11
12Run Scrapy from a script
13========================
14
15You can use the :ref:`API <topics-api>` to run Scrapy from a script, instead of
16the typical way of running Scrapy via ``scrapy crawl``.
17
18Remember that Scrapy is built on top of the Twisted
19asynchronous networking library, so you need to run it inside the Twisted reactor.
20
21The first utility you can use to run your spiders is
22:class:`scrapy.crawler.CrawlerProcess`. This class will start a Twisted reactor
23for you, configuring the logging and setting shutdown handlers. This class is
24the one used by all Scrapy commands.
25
26Here's an example showing how to run a single spider with it.
27
28::
29
30    import scrapy
31    from scrapy.crawler import CrawlerProcess
32
33    class MySpider(scrapy.Spider):
34        # Your spider definition
35        ...
36
37    process = CrawlerProcess(settings={
38        "FEEDS": {
39            "items.json": {"format": "json"},
40        },
41    })
42
43    process.crawl(MySpider)
44    process.start() # the script will block here until the crawling is finished
45
46Define settings within dictionary in CrawlerProcess. Make sure to check :class:`~scrapy.crawler.CrawlerProcess`
47documentation to get acquainted with its usage details.
48
49If you are inside a Scrapy project there are some additional helpers you can
50use to import those components within the project. You can automatically import
51your spiders passing their name to :class:`~scrapy.crawler.CrawlerProcess`, and
52use ``get_project_settings`` to get a :class:`~scrapy.settings.Settings`
53instance with your project settings.
54
55What follows is a working example of how to do that, using the `testspiders`_
56project as example.
57
58::
59
60    from scrapy.crawler import CrawlerProcess
61    from scrapy.utils.project import get_project_settings
62
63    process = CrawlerProcess(get_project_settings())
64
65    # 'followall' is the name of one of the spiders of the project.
66    process.crawl('followall', domain='scrapy.org')
67    process.start() # the script will block here until the crawling is finished
68
69There's another Scrapy utility that provides more control over the crawling
70process: :class:`scrapy.crawler.CrawlerRunner`. This class is a thin wrapper
71that encapsulates some simple helpers to run multiple crawlers, but it won't
72start or interfere with existing reactors in any way.
73
74Using this class the reactor should be explicitly run after scheduling your
75spiders. It's recommended you use :class:`~scrapy.crawler.CrawlerRunner`
76instead of :class:`~scrapy.crawler.CrawlerProcess` if your application is
77already using Twisted and you want to run Scrapy in the same reactor.
78
79Note that you will also have to shutdown the Twisted reactor yourself after the
80spider is finished. This can be achieved by adding callbacks to the deferred
81returned by the :meth:`CrawlerRunner.crawl
82<scrapy.crawler.CrawlerRunner.crawl>` method.
83
84Here's an example of its usage, along with a callback to manually stop the
85reactor after ``MySpider`` has finished running.
86
87::
88
89    from twisted.internet import reactor
90    import scrapy
91    from scrapy.crawler import CrawlerRunner
92    from scrapy.utils.log import configure_logging
93
94    class MySpider(scrapy.Spider):
95        # Your spider definition
96        ...
97
98    configure_logging({'LOG_FORMAT': '%(levelname)s: %(message)s'})
99    runner = CrawlerRunner()
100
101    d = runner.crawl(MySpider)
102    d.addBoth(lambda _: reactor.stop())
103    reactor.run() # the script will block here until the crawling is finished
104
105.. seealso:: :doc:`twisted:core/howto/reactor-basics`
106
107.. _run-multiple-spiders:
108
109Running multiple spiders in the same process
110============================================
111
112By default, Scrapy runs a single spider per process when you run ``scrapy
113crawl``. However, Scrapy supports running multiple spiders per process using
114the :ref:`internal API <topics-api>`.
115
116Here is an example that runs multiple spiders simultaneously:
117
118::
119
120    import scrapy
121    from scrapy.crawler import CrawlerProcess
122
123    class MySpider1(scrapy.Spider):
124        # Your first spider definition
125        ...
126
127    class MySpider2(scrapy.Spider):
128        # Your second spider definition
129        ...
130
131    process = CrawlerProcess()
132    process.crawl(MySpider1)
133    process.crawl(MySpider2)
134    process.start() # the script will block here until all crawling jobs are finished
135
136Same example using :class:`~scrapy.crawler.CrawlerRunner`:
137
138::
139
140    import scrapy
141    from twisted.internet import reactor
142    from scrapy.crawler import CrawlerRunner
143    from scrapy.utils.log import configure_logging
144
145    class MySpider1(scrapy.Spider):
146        # Your first spider definition
147        ...
148
149    class MySpider2(scrapy.Spider):
150        # Your second spider definition
151        ...
152
153    configure_logging()
154    runner = CrawlerRunner()
155    runner.crawl(MySpider1)
156    runner.crawl(MySpider2)
157    d = runner.join()
158    d.addBoth(lambda _: reactor.stop())
159
160    reactor.run() # the script will block here until all crawling jobs are finished
161
162Same example but running the spiders sequentially by chaining the deferreds:
163
164::
165
166    from twisted.internet import reactor, defer
167    from scrapy.crawler import CrawlerRunner
168    from scrapy.utils.log import configure_logging
169
170    class MySpider1(scrapy.Spider):
171        # Your first spider definition
172        ...
173
174    class MySpider2(scrapy.Spider):
175        # Your second spider definition
176        ...
177
178    configure_logging()
179    runner = CrawlerRunner()
180
181    @defer.inlineCallbacks
182    def crawl():
183        yield runner.crawl(MySpider1)
184        yield runner.crawl(MySpider2)
185        reactor.stop()
186
187    crawl()
188    reactor.run() # the script will block here until the last crawl call is finished
189
190.. seealso:: :ref:`run-from-script`.
191
192.. _distributed-crawls:
193
194Distributed crawls
195==================
196
197Scrapy doesn't provide any built-in facility for running crawls in a distribute
198(multi-server) manner. However, there are some ways to distribute crawls, which
199vary depending on how you plan to distribute them.
200
201If you have many spiders, the obvious way to distribute the load is to setup
202many Scrapyd instances and distribute spider runs among those.
203
204If you instead want to run a single (big) spider through many machines, what
205you usually do is partition the urls to crawl and send them to each separate
206spider. Here is a concrete example:
207
208First, you prepare the list of urls to crawl and put them into separate
209files/urls::
210
211    http://somedomain.com/urls-to-crawl/spider1/part1.list
212    http://somedomain.com/urls-to-crawl/spider1/part2.list
213    http://somedomain.com/urls-to-crawl/spider1/part3.list
214
215Then you fire a spider run on 3 different Scrapyd servers. The spider would
216receive a (spider) argument ``part`` with the number of the partition to
217crawl::
218
219    curl http://scrapy1.mycompany.com:6800/schedule.json -d project=myproject -d spider=spider1 -d part=1
220    curl http://scrapy2.mycompany.com:6800/schedule.json -d project=myproject -d spider=spider1 -d part=2
221    curl http://scrapy3.mycompany.com:6800/schedule.json -d project=myproject -d spider=spider1 -d part=3
222
223.. _bans:
224
225Avoiding getting banned
226=======================
227
228Some websites implement certain measures to prevent bots from crawling them,
229with varying degrees of sophistication. Getting around those measures can be
230difficult and tricky, and may sometimes require special infrastructure. Please
231consider contacting `commercial support`_ if in doubt.
232
233Here are some tips to keep in mind when dealing with these kinds of sites:
234
235* rotate your user agent from a pool of well-known ones from browsers (google
236  around to get a list of them)
237* disable cookies (see :setting:`COOKIES_ENABLED`) as some sites may use
238  cookies to spot bot behaviour
239* use download delays (2 or higher). See :setting:`DOWNLOAD_DELAY` setting.
240* if possible, use `Google cache`_ to fetch pages, instead of hitting the sites
241  directly
242* use a pool of rotating IPs. For example, the free `Tor project`_ or paid
243  services like `ProxyMesh`_. An open source alternative is `scrapoxy`_, a
244  super proxy that you can attach your own proxies to.
245* use a highly distributed downloader that circumvents bans internally, so you
246  can just focus on parsing clean pages. One example of such downloaders is
247  `Zyte Smart Proxy Manager`_
248
249If you are still unable to prevent your bot getting banned, consider contacting
250`commercial support`_.
251
252.. _Tor project: https://www.torproject.org/
253.. _commercial support: https://scrapy.org/support/
254.. _ProxyMesh: https://proxymesh.com/
255.. _Google cache: http://www.googleguide.com/cached_pages.html
256.. _testspiders: https://github.com/scrapinghub/testspiders
257.. _scrapoxy: https://scrapoxy.io/
258.. _Zyte Smart Proxy Manager: https://www.zyte.com/smart-proxy-manager/
259