1.. _faq:
2
3Frequently Asked Questions
4==========================
5
6.. _faq-scrapy-bs-cmp:
7
8How does Scrapy compare to BeautifulSoup or lxml?
9-------------------------------------------------
10
11`BeautifulSoup`_ and `lxml`_ are libraries for parsing HTML and XML. Scrapy is
12an application framework for writing web spiders that crawl web sites and
13extract data from them.
14
15Scrapy provides a built-in mechanism for extracting data (called
16:ref:`selectors <topics-selectors>`) but you can easily use `BeautifulSoup`_
17(or `lxml`_) instead, if you feel more comfortable working with them. After
18all, they're just parsing libraries which can be imported and used from any
19Python code.
20
21In other words, comparing `BeautifulSoup`_ (or `lxml`_) to Scrapy is like
22comparing `jinja2`_ to `Django`_.
23
24.. _BeautifulSoup: https://www.crummy.com/software/BeautifulSoup/
25.. _lxml: https://lxml.de/
26.. _jinja2: https://palletsprojects.com/p/jinja/
27.. _Django: https://www.djangoproject.com/
28
29Can I use Scrapy with BeautifulSoup?
30------------------------------------
31
32Yes, you can.
33As mentioned :ref:`above <faq-scrapy-bs-cmp>`, `BeautifulSoup`_ can be used
34for parsing HTML responses in Scrapy callbacks.
35You just have to feed the response's body into a ``BeautifulSoup`` object
36and extract whatever data you need from it.
37
38Here's an example spider using BeautifulSoup API, with ``lxml`` as the HTML parser::
39
40
41    from bs4 import BeautifulSoup
42    import scrapy
43
44
45    class ExampleSpider(scrapy.Spider):
46        name = "example"
47        allowed_domains = ["example.com"]
48        start_urls = (
49            'http://www.example.com/',
50        )
51
52        def parse(self, response):
53            # use lxml to get decent HTML parsing speed
54            soup = BeautifulSoup(response.text, 'lxml')
55            yield {
56                "url": response.url,
57                "title": soup.h1.string
58            }
59
60.. note::
61
62    ``BeautifulSoup`` supports several HTML/XML parsers.
63    See `BeautifulSoup's official documentation`_ on which ones are available.
64
65.. _BeautifulSoup's official documentation: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#specifying-the-parser-to-use
66
67
68Did Scrapy "steal" X from Django?
69---------------------------------
70
71Probably, but we don't like that word. We think Django_ is a great open source
72project and an example to follow, so we've used it as an inspiration for
73Scrapy.
74
75We believe that, if something is already done well, there's no need to reinvent
76it. This concept, besides being one of the foundations for open source and free
77software, not only applies to software but also to documentation, procedures,
78policies, etc. So, instead of going through each problem ourselves, we choose
79to copy ideas from those projects that have already solved them properly, and
80focus on the real problems we need to solve.
81
82We'd be proud if Scrapy serves as an inspiration for other projects. Feel free
83to steal from us!
84
85Does Scrapy work with HTTP proxies?
86-----------------------------------
87
88Yes. Support for HTTP proxies is provided (since Scrapy 0.8) through the HTTP
89Proxy downloader middleware. See
90:class:`~scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware`.
91
92How can I scrape an item with attributes in different pages?
93------------------------------------------------------------
94
95See :ref:`topics-request-response-ref-request-callback-arguments`.
96
97
98Scrapy crashes with: ImportError: No module named win32api
99----------------------------------------------------------
100
101You need to install `pywin32`_ because of `this Twisted bug`_.
102
103.. _pywin32: https://sourceforge.net/projects/pywin32/
104.. _this Twisted bug: https://twistedmatrix.com/trac/ticket/3707
105
106How can I simulate a user login in my spider?
107---------------------------------------------
108
109See :ref:`topics-request-response-ref-request-userlogin`.
110
111.. _faq-bfo-dfo:
112
113Does Scrapy crawl in breadth-first or depth-first order?
114--------------------------------------------------------
115
116By default, Scrapy uses a `LIFO`_ queue for storing pending requests, which
117basically means that it crawls in `DFO order`_. This order is more convenient
118in most cases.
119
120If you do want to crawl in true `BFO order`_, you can do it by
121setting the following settings::
122
123    DEPTH_PRIORITY = 1
124    SCHEDULER_DISK_QUEUE = 'scrapy.squeues.PickleFifoDiskQueue'
125    SCHEDULER_MEMORY_QUEUE = 'scrapy.squeues.FifoMemoryQueue'
126
127While pending requests are below the configured values of
128:setting:`CONCURRENT_REQUESTS`, :setting:`CONCURRENT_REQUESTS_PER_DOMAIN` or
129:setting:`CONCURRENT_REQUESTS_PER_IP`, those requests are sent
130concurrently. As a result, the first few requests of a crawl rarely follow the
131desired order. Lowering those settings to ``1`` enforces the desired order, but
132it significantly slows down the crawl as a whole.
133
134
135My Scrapy crawler has memory leaks. What can I do?
136--------------------------------------------------
137
138See :ref:`topics-leaks`.
139
140Also, Python has a builtin memory leak issue which is described in
141:ref:`topics-leaks-without-leaks`.
142
143How can I make Scrapy consume less memory?
144------------------------------------------
145
146See previous question.
147
148How can I prevent memory errors due to many allowed domains?
149------------------------------------------------------------
150
151If you have a spider with a long list of
152:attr:`~scrapy.spiders.Spider.allowed_domains` (e.g. 50,000+), consider
153replacing the default
154:class:`~scrapy.spidermiddlewares.offsite.OffsiteMiddleware` spider middleware
155with a :ref:`custom spider middleware <custom-spider-middleware>` that requires
156less memory. For example:
157
158-   If your domain names are similar enough, use your own regular expression
159    instead joining the strings in
160    :attr:`~scrapy.spiders.Spider.allowed_domains` into a complex regular
161    expression.
162
163-   If you can `meet the installation requirements`_, use pyre2_ instead of
164    Python’s re_ to compile your URL-filtering regular expression. See
165    :issue:`1908`.
166
167See also other suggestions at `StackOverflow`_.
168
169.. note:: Remember to disable
170   :class:`scrapy.spidermiddlewares.offsite.OffsiteMiddleware` when you enable
171   your custom implementation::
172
173       SPIDER_MIDDLEWARES = {
174           'scrapy.spidermiddlewares.offsite.OffsiteMiddleware': None,
175           'myproject.middlewares.CustomOffsiteMiddleware': 500,
176       }
177
178.. _meet the installation requirements: https://github.com/andreasvc/pyre2#installation
179.. _pyre2: https://github.com/andreasvc/pyre2
180.. _re: https://docs.python.org/library/re.html
181.. _StackOverflow: https://stackoverflow.com/q/36440681/939364
182
183Can I use Basic HTTP Authentication in my spiders?
184--------------------------------------------------
185
186Yes, see :class:`~scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware`.
187
188Why does Scrapy download pages in English instead of my native language?
189------------------------------------------------------------------------
190
191Try changing the default `Accept-Language`_ request header by overriding the
192:setting:`DEFAULT_REQUEST_HEADERS` setting.
193
194.. _Accept-Language: https://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.4
195
196Where can I find some example Scrapy projects?
197----------------------------------------------
198
199See :ref:`intro-examples`.
200
201Can I run a spider without creating a project?
202----------------------------------------------
203
204Yes. You can use the :command:`runspider` command. For example, if you have a
205spider written in a ``my_spider.py`` file you can run it with::
206
207    scrapy runspider my_spider.py
208
209See :command:`runspider` command for more info.
210
211I get "Filtered offsite request" messages. How can I fix them?
212--------------------------------------------------------------
213
214Those messages (logged with ``DEBUG`` level) don't necessarily mean there is a
215problem, so you may not need to fix them.
216
217Those messages are thrown by the Offsite Spider Middleware, which is a spider
218middleware (enabled by default) whose purpose is to filter out requests to
219domains outside the ones covered by the spider.
220
221For more info see:
222:class:`~scrapy.spidermiddlewares.offsite.OffsiteMiddleware`.
223
224What is the recommended way to deploy a Scrapy crawler in production?
225---------------------------------------------------------------------
226
227See :ref:`topics-deploy`.
228
229Can I use JSON for large exports?
230---------------------------------
231
232It'll depend on how large your output is. See :ref:`this warning
233<json-with-large-data>` in :class:`~scrapy.exporters.JsonItemExporter`
234documentation.
235
236Can I return (Twisted) deferreds from signal handlers?
237------------------------------------------------------
238
239Some signals support returning deferreds from their handlers, others don't. See
240the :ref:`topics-signals-ref` to know which ones.
241
242What does the response status code 999 means?
243---------------------------------------------
244
245999 is a custom response status code used by Yahoo sites to throttle requests.
246Try slowing down the crawling speed by using a download delay of ``2`` (or
247higher) in your spider::
248
249    class MySpider(CrawlSpider):
250
251        name = 'myspider'
252
253        download_delay = 2
254
255        # [ ... rest of the spider code ... ]
256
257Or by setting a global download delay in your project with the
258:setting:`DOWNLOAD_DELAY` setting.
259
260Can I call ``pdb.set_trace()`` from my spiders to debug them?
261-------------------------------------------------------------
262
263Yes, but you can also use the Scrapy shell which allows you to quickly analyze
264(and even modify) the response being processed by your spider, which is, quite
265often, more useful than plain old ``pdb.set_trace()``.
266
267For more info see :ref:`topics-shell-inspect-response`.
268
269Simplest way to dump all my scraped items into a JSON/CSV/XML file?
270-------------------------------------------------------------------
271
272To dump into a JSON file::
273
274    scrapy crawl myspider -O items.json
275
276To dump into a CSV file::
277
278    scrapy crawl myspider -O items.csv
279
280To dump into a XML file::
281
282    scrapy crawl myspider -O items.xml
283
284For more information see :ref:`topics-feed-exports`
285
286What's this huge cryptic ``__VIEWSTATE`` parameter used in some forms?
287----------------------------------------------------------------------
288
289The ``__VIEWSTATE`` parameter is used in sites built with ASP.NET/VB.NET. For
290more info on how it works see `this page`_. Also, here's an `example spider`_
291which scrapes one of these sites.
292
293.. _this page: https://metacpan.org/pod/release/ECARROLL/HTML-TreeBuilderX-ASP_NET-0.09/lib/HTML/TreeBuilderX/ASP_NET.pm
294.. _example spider: https://github.com/AmbientLighter/rpn-fas/blob/master/fas/spiders/rnp.py
295
296What's the best way to parse big XML/CSV data feeds?
297----------------------------------------------------
298
299Parsing big feeds with XPath selectors can be problematic since they need to
300build the DOM of the entire feed in memory, and this can be quite slow and
301consume a lot of memory.
302
303In order to avoid parsing all the entire feed at once in memory, you can use
304the functions ``xmliter`` and ``csviter`` from ``scrapy.utils.iterators``
305module. In fact, this is what the feed spiders (see :ref:`topics-spiders`) use
306under the cover.
307
308Does Scrapy manage cookies automatically?
309-----------------------------------------
310
311Yes, Scrapy receives and keeps track of cookies sent by servers, and sends them
312back on subsequent requests, like any regular web browser does.
313
314For more info see :ref:`topics-request-response` and :ref:`cookies-mw`.
315
316How can I see the cookies being sent and received from Scrapy?
317--------------------------------------------------------------
318
319Enable the :setting:`COOKIES_DEBUG` setting.
320
321How can I instruct a spider to stop itself?
322-------------------------------------------
323
324Raise the :exc:`~scrapy.exceptions.CloseSpider` exception from a callback. For
325more info see: :exc:`~scrapy.exceptions.CloseSpider`.
326
327How can I prevent my Scrapy bot from getting banned?
328----------------------------------------------------
329
330See :ref:`bans`.
331
332Should I use spider arguments or settings to configure my spider?
333-----------------------------------------------------------------
334
335Both :ref:`spider arguments <spiderargs>` and :ref:`settings <topics-settings>`
336can be used to configure your spider. There is no strict rule that mandates to
337use one or the other, but settings are more suited for parameters that, once
338set, don't change much, while spider arguments are meant to change more often,
339even on each spider run and sometimes are required for the spider to run at all
340(for example, to set the start url of a spider).
341
342To illustrate with an example, assuming you have a spider that needs to log
343into a site to scrape data, and you only want to scrape data from a certain
344section of the site (which varies each time). In that case, the credentials to
345log in would be settings, while the url of the section to scrape would be a
346spider argument.
347
348I'm scraping a XML document and my XPath selector doesn't return any items
349--------------------------------------------------------------------------
350
351You may need to remove namespaces. See :ref:`removing-namespaces`.
352
353.. _faq-split-item:
354
355How to split an item into multiple items in an item pipeline?
356-------------------------------------------------------------
357
358:ref:`Item pipelines <topics-item-pipeline>` cannot yield multiple items per
359input item. :ref:`Create a spider middleware <custom-spider-middleware>`
360instead, and use its
361:meth:`~scrapy.spidermiddlewares.SpiderMiddleware.process_spider_output`
362method for this purpose. For example::
363
364    from copy import deepcopy
365
366    from itemadapter import is_item, ItemAdapter
367
368    class MultiplyItemsMiddleware:
369
370        def process_spider_output(self, response, result, spider):
371            for item in result:
372                if is_item(item):
373                    adapter = ItemAdapter(item)
374                    for _ in range(adapter['multiply_by']):
375                        yield deepcopy(item)
376
377Does Scrapy support IPv6 addresses?
378-----------------------------------
379
380Yes, by setting :setting:`DNS_RESOLVER` to ``scrapy.resolver.CachingHostnameResolver``.
381Note that by doing so, you lose the ability to set a specific timeout for DNS requests
382(the value of the :setting:`DNS_TIMEOUT` setting is ignored).
383
384
385.. _faq-specific-reactor:
386
387How to deal with ``<class 'ValueError'>: filedescriptor out of range in select()`` exceptions?
388----------------------------------------------------------------------------------------------
389
390This issue `has been reported`_ to appear when running broad crawls in macOS, where the default
391Twisted reactor is :class:`twisted.internet.selectreactor.SelectReactor`. Switching to a
392different reactor is possible by using the :setting:`TWISTED_REACTOR` setting.
393
394
395.. _faq-stop-response-download:
396
397How can I cancel the download of a given response?
398--------------------------------------------------
399
400In some situations, it might be useful to stop the download of a certain response.
401For instance, sometimes you can determine whether or not you need the full contents
402of a response by inspecting its headers or the first bytes of its body. In that case,
403you could save resources by attaching a handler to the :class:`~scrapy.signals.bytes_received`
404or :class:`~scrapy.signals.headers_received` signals and raising a
405:exc:`~scrapy.exceptions.StopDownload` exception. Please refer to the
406:ref:`topics-stop-response-download` topic for additional information and examples.
407
408
409.. _has been reported: https://github.com/scrapy/scrapy/issues/2905
410.. _user agents: https://en.wikipedia.org/wiki/User_agent
411.. _LIFO: https://en.wikipedia.org/wiki/Stack_(abstract_data_type)
412.. _DFO order: https://en.wikipedia.org/wiki/Depth-first_search
413.. _BFO order: https://en.wikipedia.org/wiki/Breadth-first_search
414