1.. _faq: 2 3Frequently Asked Questions 4========================== 5 6.. _faq-scrapy-bs-cmp: 7 8How does Scrapy compare to BeautifulSoup or lxml? 9------------------------------------------------- 10 11`BeautifulSoup`_ and `lxml`_ are libraries for parsing HTML and XML. Scrapy is 12an application framework for writing web spiders that crawl web sites and 13extract data from them. 14 15Scrapy provides a built-in mechanism for extracting data (called 16:ref:`selectors <topics-selectors>`) but you can easily use `BeautifulSoup`_ 17(or `lxml`_) instead, if you feel more comfortable working with them. After 18all, they're just parsing libraries which can be imported and used from any 19Python code. 20 21In other words, comparing `BeautifulSoup`_ (or `lxml`_) to Scrapy is like 22comparing `jinja2`_ to `Django`_. 23 24.. _BeautifulSoup: https://www.crummy.com/software/BeautifulSoup/ 25.. _lxml: https://lxml.de/ 26.. _jinja2: https://palletsprojects.com/p/jinja/ 27.. _Django: https://www.djangoproject.com/ 28 29Can I use Scrapy with BeautifulSoup? 30------------------------------------ 31 32Yes, you can. 33As mentioned :ref:`above <faq-scrapy-bs-cmp>`, `BeautifulSoup`_ can be used 34for parsing HTML responses in Scrapy callbacks. 35You just have to feed the response's body into a ``BeautifulSoup`` object 36and extract whatever data you need from it. 37 38Here's an example spider using BeautifulSoup API, with ``lxml`` as the HTML parser:: 39 40 41 from bs4 import BeautifulSoup 42 import scrapy 43 44 45 class ExampleSpider(scrapy.Spider): 46 name = "example" 47 allowed_domains = ["example.com"] 48 start_urls = ( 49 'http://www.example.com/', 50 ) 51 52 def parse(self, response): 53 # use lxml to get decent HTML parsing speed 54 soup = BeautifulSoup(response.text, 'lxml') 55 yield { 56 "url": response.url, 57 "title": soup.h1.string 58 } 59 60.. note:: 61 62 ``BeautifulSoup`` supports several HTML/XML parsers. 63 See `BeautifulSoup's official documentation`_ on which ones are available. 64 65.. _BeautifulSoup's official documentation: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#specifying-the-parser-to-use 66 67 68Did Scrapy "steal" X from Django? 69--------------------------------- 70 71Probably, but we don't like that word. We think Django_ is a great open source 72project and an example to follow, so we've used it as an inspiration for 73Scrapy. 74 75We believe that, if something is already done well, there's no need to reinvent 76it. This concept, besides being one of the foundations for open source and free 77software, not only applies to software but also to documentation, procedures, 78policies, etc. So, instead of going through each problem ourselves, we choose 79to copy ideas from those projects that have already solved them properly, and 80focus on the real problems we need to solve. 81 82We'd be proud if Scrapy serves as an inspiration for other projects. Feel free 83to steal from us! 84 85Does Scrapy work with HTTP proxies? 86----------------------------------- 87 88Yes. Support for HTTP proxies is provided (since Scrapy 0.8) through the HTTP 89Proxy downloader middleware. See 90:class:`~scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware`. 91 92How can I scrape an item with attributes in different pages? 93------------------------------------------------------------ 94 95See :ref:`topics-request-response-ref-request-callback-arguments`. 96 97 98Scrapy crashes with: ImportError: No module named win32api 99---------------------------------------------------------- 100 101You need to install `pywin32`_ because of `this Twisted bug`_. 102 103.. _pywin32: https://sourceforge.net/projects/pywin32/ 104.. _this Twisted bug: https://twistedmatrix.com/trac/ticket/3707 105 106How can I simulate a user login in my spider? 107--------------------------------------------- 108 109See :ref:`topics-request-response-ref-request-userlogin`. 110 111.. _faq-bfo-dfo: 112 113Does Scrapy crawl in breadth-first or depth-first order? 114-------------------------------------------------------- 115 116By default, Scrapy uses a `LIFO`_ queue for storing pending requests, which 117basically means that it crawls in `DFO order`_. This order is more convenient 118in most cases. 119 120If you do want to crawl in true `BFO order`_, you can do it by 121setting the following settings:: 122 123 DEPTH_PRIORITY = 1 124 SCHEDULER_DISK_QUEUE = 'scrapy.squeues.PickleFifoDiskQueue' 125 SCHEDULER_MEMORY_QUEUE = 'scrapy.squeues.FifoMemoryQueue' 126 127While pending requests are below the configured values of 128:setting:`CONCURRENT_REQUESTS`, :setting:`CONCURRENT_REQUESTS_PER_DOMAIN` or 129:setting:`CONCURRENT_REQUESTS_PER_IP`, those requests are sent 130concurrently. As a result, the first few requests of a crawl rarely follow the 131desired order. Lowering those settings to ``1`` enforces the desired order, but 132it significantly slows down the crawl as a whole. 133 134 135My Scrapy crawler has memory leaks. What can I do? 136-------------------------------------------------- 137 138See :ref:`topics-leaks`. 139 140Also, Python has a builtin memory leak issue which is described in 141:ref:`topics-leaks-without-leaks`. 142 143How can I make Scrapy consume less memory? 144------------------------------------------ 145 146See previous question. 147 148How can I prevent memory errors due to many allowed domains? 149------------------------------------------------------------ 150 151If you have a spider with a long list of 152:attr:`~scrapy.spiders.Spider.allowed_domains` (e.g. 50,000+), consider 153replacing the default 154:class:`~scrapy.spidermiddlewares.offsite.OffsiteMiddleware` spider middleware 155with a :ref:`custom spider middleware <custom-spider-middleware>` that requires 156less memory. For example: 157 158- If your domain names are similar enough, use your own regular expression 159 instead joining the strings in 160 :attr:`~scrapy.spiders.Spider.allowed_domains` into a complex regular 161 expression. 162 163- If you can `meet the installation requirements`_, use pyre2_ instead of 164 Python’s re_ to compile your URL-filtering regular expression. See 165 :issue:`1908`. 166 167See also other suggestions at `StackOverflow`_. 168 169.. note:: Remember to disable 170 :class:`scrapy.spidermiddlewares.offsite.OffsiteMiddleware` when you enable 171 your custom implementation:: 172 173 SPIDER_MIDDLEWARES = { 174 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware': None, 175 'myproject.middlewares.CustomOffsiteMiddleware': 500, 176 } 177 178.. _meet the installation requirements: https://github.com/andreasvc/pyre2#installation 179.. _pyre2: https://github.com/andreasvc/pyre2 180.. _re: https://docs.python.org/library/re.html 181.. _StackOverflow: https://stackoverflow.com/q/36440681/939364 182 183Can I use Basic HTTP Authentication in my spiders? 184-------------------------------------------------- 185 186Yes, see :class:`~scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware`. 187 188Why does Scrapy download pages in English instead of my native language? 189------------------------------------------------------------------------ 190 191Try changing the default `Accept-Language`_ request header by overriding the 192:setting:`DEFAULT_REQUEST_HEADERS` setting. 193 194.. _Accept-Language: https://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.4 195 196Where can I find some example Scrapy projects? 197---------------------------------------------- 198 199See :ref:`intro-examples`. 200 201Can I run a spider without creating a project? 202---------------------------------------------- 203 204Yes. You can use the :command:`runspider` command. For example, if you have a 205spider written in a ``my_spider.py`` file you can run it with:: 206 207 scrapy runspider my_spider.py 208 209See :command:`runspider` command for more info. 210 211I get "Filtered offsite request" messages. How can I fix them? 212-------------------------------------------------------------- 213 214Those messages (logged with ``DEBUG`` level) don't necessarily mean there is a 215problem, so you may not need to fix them. 216 217Those messages are thrown by the Offsite Spider Middleware, which is a spider 218middleware (enabled by default) whose purpose is to filter out requests to 219domains outside the ones covered by the spider. 220 221For more info see: 222:class:`~scrapy.spidermiddlewares.offsite.OffsiteMiddleware`. 223 224What is the recommended way to deploy a Scrapy crawler in production? 225--------------------------------------------------------------------- 226 227See :ref:`topics-deploy`. 228 229Can I use JSON for large exports? 230--------------------------------- 231 232It'll depend on how large your output is. See :ref:`this warning 233<json-with-large-data>` in :class:`~scrapy.exporters.JsonItemExporter` 234documentation. 235 236Can I return (Twisted) deferreds from signal handlers? 237------------------------------------------------------ 238 239Some signals support returning deferreds from their handlers, others don't. See 240the :ref:`topics-signals-ref` to know which ones. 241 242What does the response status code 999 means? 243--------------------------------------------- 244 245999 is a custom response status code used by Yahoo sites to throttle requests. 246Try slowing down the crawling speed by using a download delay of ``2`` (or 247higher) in your spider:: 248 249 class MySpider(CrawlSpider): 250 251 name = 'myspider' 252 253 download_delay = 2 254 255 # [ ... rest of the spider code ... ] 256 257Or by setting a global download delay in your project with the 258:setting:`DOWNLOAD_DELAY` setting. 259 260Can I call ``pdb.set_trace()`` from my spiders to debug them? 261------------------------------------------------------------- 262 263Yes, but you can also use the Scrapy shell which allows you to quickly analyze 264(and even modify) the response being processed by your spider, which is, quite 265often, more useful than plain old ``pdb.set_trace()``. 266 267For more info see :ref:`topics-shell-inspect-response`. 268 269Simplest way to dump all my scraped items into a JSON/CSV/XML file? 270------------------------------------------------------------------- 271 272To dump into a JSON file:: 273 274 scrapy crawl myspider -O items.json 275 276To dump into a CSV file:: 277 278 scrapy crawl myspider -O items.csv 279 280To dump into a XML file:: 281 282 scrapy crawl myspider -O items.xml 283 284For more information see :ref:`topics-feed-exports` 285 286What's this huge cryptic ``__VIEWSTATE`` parameter used in some forms? 287---------------------------------------------------------------------- 288 289The ``__VIEWSTATE`` parameter is used in sites built with ASP.NET/VB.NET. For 290more info on how it works see `this page`_. Also, here's an `example spider`_ 291which scrapes one of these sites. 292 293.. _this page: https://metacpan.org/pod/release/ECARROLL/HTML-TreeBuilderX-ASP_NET-0.09/lib/HTML/TreeBuilderX/ASP_NET.pm 294.. _example spider: https://github.com/AmbientLighter/rpn-fas/blob/master/fas/spiders/rnp.py 295 296What's the best way to parse big XML/CSV data feeds? 297---------------------------------------------------- 298 299Parsing big feeds with XPath selectors can be problematic since they need to 300build the DOM of the entire feed in memory, and this can be quite slow and 301consume a lot of memory. 302 303In order to avoid parsing all the entire feed at once in memory, you can use 304the functions ``xmliter`` and ``csviter`` from ``scrapy.utils.iterators`` 305module. In fact, this is what the feed spiders (see :ref:`topics-spiders`) use 306under the cover. 307 308Does Scrapy manage cookies automatically? 309----------------------------------------- 310 311Yes, Scrapy receives and keeps track of cookies sent by servers, and sends them 312back on subsequent requests, like any regular web browser does. 313 314For more info see :ref:`topics-request-response` and :ref:`cookies-mw`. 315 316How can I see the cookies being sent and received from Scrapy? 317-------------------------------------------------------------- 318 319Enable the :setting:`COOKIES_DEBUG` setting. 320 321How can I instruct a spider to stop itself? 322------------------------------------------- 323 324Raise the :exc:`~scrapy.exceptions.CloseSpider` exception from a callback. For 325more info see: :exc:`~scrapy.exceptions.CloseSpider`. 326 327How can I prevent my Scrapy bot from getting banned? 328---------------------------------------------------- 329 330See :ref:`bans`. 331 332Should I use spider arguments or settings to configure my spider? 333----------------------------------------------------------------- 334 335Both :ref:`spider arguments <spiderargs>` and :ref:`settings <topics-settings>` 336can be used to configure your spider. There is no strict rule that mandates to 337use one or the other, but settings are more suited for parameters that, once 338set, don't change much, while spider arguments are meant to change more often, 339even on each spider run and sometimes are required for the spider to run at all 340(for example, to set the start url of a spider). 341 342To illustrate with an example, assuming you have a spider that needs to log 343into a site to scrape data, and you only want to scrape data from a certain 344section of the site (which varies each time). In that case, the credentials to 345log in would be settings, while the url of the section to scrape would be a 346spider argument. 347 348I'm scraping a XML document and my XPath selector doesn't return any items 349-------------------------------------------------------------------------- 350 351You may need to remove namespaces. See :ref:`removing-namespaces`. 352 353.. _faq-split-item: 354 355How to split an item into multiple items in an item pipeline? 356------------------------------------------------------------- 357 358:ref:`Item pipelines <topics-item-pipeline>` cannot yield multiple items per 359input item. :ref:`Create a spider middleware <custom-spider-middleware>` 360instead, and use its 361:meth:`~scrapy.spidermiddlewares.SpiderMiddleware.process_spider_output` 362method for this purpose. For example:: 363 364 from copy import deepcopy 365 366 from itemadapter import is_item, ItemAdapter 367 368 class MultiplyItemsMiddleware: 369 370 def process_spider_output(self, response, result, spider): 371 for item in result: 372 if is_item(item): 373 adapter = ItemAdapter(item) 374 for _ in range(adapter['multiply_by']): 375 yield deepcopy(item) 376 377Does Scrapy support IPv6 addresses? 378----------------------------------- 379 380Yes, by setting :setting:`DNS_RESOLVER` to ``scrapy.resolver.CachingHostnameResolver``. 381Note that by doing so, you lose the ability to set a specific timeout for DNS requests 382(the value of the :setting:`DNS_TIMEOUT` setting is ignored). 383 384 385.. _faq-specific-reactor: 386 387How to deal with ``<class 'ValueError'>: filedescriptor out of range in select()`` exceptions? 388---------------------------------------------------------------------------------------------- 389 390This issue `has been reported`_ to appear when running broad crawls in macOS, where the default 391Twisted reactor is :class:`twisted.internet.selectreactor.SelectReactor`. Switching to a 392different reactor is possible by using the :setting:`TWISTED_REACTOR` setting. 393 394 395.. _faq-stop-response-download: 396 397How can I cancel the download of a given response? 398-------------------------------------------------- 399 400In some situations, it might be useful to stop the download of a certain response. 401For instance, sometimes you can determine whether or not you need the full contents 402of a response by inspecting its headers or the first bytes of its body. In that case, 403you could save resources by attaching a handler to the :class:`~scrapy.signals.bytes_received` 404or :class:`~scrapy.signals.headers_received` signals and raising a 405:exc:`~scrapy.exceptions.StopDownload` exception. Please refer to the 406:ref:`topics-stop-response-download` topic for additional information and examples. 407 408 409.. _has been reported: https://github.com/scrapy/scrapy/issues/2905 410.. _user agents: https://en.wikipedia.org/wiki/User_agent 411.. _LIFO: https://en.wikipedia.org/wiki/Stack_(abstract_data_type) 412.. _DFO order: https://en.wikipedia.org/wiki/Depth-first_search 413.. _BFO order: https://en.wikipedia.org/wiki/Breadth-first_search 414