1.. _topics-shell: 2 3============ 4Scrapy shell 5============ 6 7The Scrapy shell is an interactive shell where you can try and debug your 8scraping code very quickly, without having to run the spider. It's meant to be 9used for testing data extraction code, but you can actually use it for testing 10any kind of code as it is also a regular Python shell. 11 12The shell is used for testing XPath or CSS expressions and see how they work 13and what data they extract from the web pages you're trying to scrape. It 14allows you to interactively test your expressions while you're writing your 15spider, without having to run the spider to test every change. 16 17Once you get familiarized with the Scrapy shell, you'll see that it's an 18invaluable tool for developing and debugging your spiders. 19 20Configuring the shell 21===================== 22 23If you have `IPython`_ installed, the Scrapy shell will use it (instead of the 24standard Python console). The `IPython`_ console is much more powerful and 25provides smart auto-completion and colorized output, among other things. 26 27We highly recommend you install `IPython`_, specially if you're working on 28Unix systems (where `IPython`_ excels). See the `IPython installation guide`_ 29for more info. 30 31Scrapy also has support for `bpython`_, and will try to use it where `IPython`_ 32is unavailable. 33 34Through Scrapy's settings you can configure it to use any one of 35``ipython``, ``bpython`` or the standard ``python`` shell, regardless of which 36are installed. This is done by setting the ``SCRAPY_PYTHON_SHELL`` environment 37variable; or by defining it in your :ref:`scrapy.cfg <topics-config-settings>`:: 38 39 [settings] 40 shell = bpython 41 42.. _IPython: https://ipython.org/ 43.. _IPython installation guide: https://ipython.org/install.html 44.. _bpython: https://bpython-interpreter.org/ 45 46Launch the shell 47================ 48 49To launch the Scrapy shell you can use the :command:`shell` command like 50this:: 51 52 scrapy shell <url> 53 54Where the ``<url>`` is the URL you want to scrape. 55 56:command:`shell` also works for local files. This can be handy if you want 57to play around with a local copy of a web page. :command:`shell` understands 58the following syntaxes for local files:: 59 60 # UNIX-style 61 scrapy shell ./path/to/file.html 62 scrapy shell ../other/path/to/file.html 63 scrapy shell /absolute/path/to/file.html 64 65 # File URI 66 scrapy shell file:///absolute/path/to/file.html 67 68.. note:: When using relative file paths, be explicit and prepend them 69 with ``./`` (or ``../`` when relevant). 70 ``scrapy shell index.html`` will not work as one might expect (and 71 this is by design, not a bug). 72 73 Because :command:`shell` favors HTTP URLs over File URIs, 74 and ``index.html`` being syntactically similar to ``example.com``, 75 :command:`shell` will treat ``index.html`` as a domain name and trigger 76 a DNS lookup error:: 77 78 $ scrapy shell index.html 79 [ ... scrapy shell starts ... ] 80 [ ... traceback ... ] 81 twisted.internet.error.DNSLookupError: DNS lookup failed: 82 address 'index.html' not found: [Errno -5] No address associated with hostname. 83 84 :command:`shell` will not test beforehand if a file called ``index.html`` 85 exists in the current directory. Again, be explicit. 86 87 88Using the shell 89=============== 90 91The Scrapy shell is just a regular Python console (or `IPython`_ console if you 92have it available) which provides some additional shortcut functions for 93convenience. 94 95Available Shortcuts 96------------------- 97 98- ``shelp()`` - print a help with the list of available objects and 99 shortcuts 100 101- ``fetch(url[, redirect=True])`` - fetch a new response from the given URL 102 and update all related objects accordingly. You can optionaly ask for HTTP 103 3xx redirections to not be followed by passing ``redirect=False`` 104 105- ``fetch(request)`` - fetch a new response from the given request and update 106 all related objects accordingly. 107 108- ``view(response)`` - open the given response in your local web browser, for 109 inspection. This will add a `\<base\> tag`_ to the response body in order 110 for external links (such as images and style sheets) to display properly. 111 Note, however, that this will create a temporary file in your computer, 112 which won't be removed automatically. 113 114.. _<base> tag: https://developer.mozilla.org/en-US/docs/Web/HTML/Element/base 115 116Available Scrapy objects 117------------------------ 118 119The Scrapy shell automatically creates some convenient objects from the 120downloaded page, like the :class:`~scrapy.http.Response` object and the 121:class:`~scrapy.selector.Selector` objects (for both HTML and XML 122content). 123 124Those objects are: 125 126- ``crawler`` - the current :class:`~scrapy.crawler.Crawler` object. 127 128- ``spider`` - the Spider which is known to handle the URL, or a 129 :class:`~scrapy.spiders.Spider` object if there is no spider found for the 130 current URL 131 132- ``request`` - a :class:`~scrapy.http.Request` object of the last fetched 133 page. You can modify this request using 134 :meth:`~scrapy.http.Request.replace` or fetch a new request (without 135 leaving the shell) using the ``fetch`` shortcut. 136 137- ``response`` - a :class:`~scrapy.http.Response` object containing the last 138 fetched page 139 140- ``settings`` - the current :ref:`Scrapy settings <topics-settings>` 141 142Example of shell session 143======================== 144 145Here's an example of a typical shell session where we start by scraping the 146https://scrapy.org page, and then proceed to scrape the https://old.reddit.com/ 147page. Finally, we modify the (Reddit) request method to POST and re-fetch it 148getting an error. We end the session by typing Ctrl-D (in Unix systems) or 149Ctrl-Z in Windows. 150 151Keep in mind that the data extracted here may not be the same when you try it, 152as those pages are not static and could have changed by the time you test this. 153The only purpose of this example is to get you familiarized with how the Scrapy 154shell works. 155 156First, we launch the shell:: 157 158 scrapy shell 'https://scrapy.org' --nolog 159 160.. note:: 161 162 Remember to always enclose URLs in quotes when running the Scrapy shell from 163 the command line, otherwise URLs containing arguments (i.e. the ``&`` character) 164 will not work. 165 166 On Windows, use double quotes instead:: 167 168 scrapy shell "https://scrapy.org" --nolog 169 170 171Then, the shell fetches the URL (using the Scrapy downloader) and prints the 172list of available objects and useful shortcuts (you'll notice that these lines 173all start with the ``[s]`` prefix):: 174 175 [s] Available Scrapy objects: 176 [s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc) 177 [s] crawler <scrapy.crawler.Crawler object at 0x7f07395dd690> 178 [s] item {} 179 [s] request <GET https://scrapy.org> 180 [s] response <200 https://scrapy.org/> 181 [s] settings <scrapy.settings.Settings object at 0x7f07395dd710> 182 [s] spider <DefaultSpider 'default' at 0x7f0735891690> 183 [s] Useful shortcuts: 184 [s] fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed) 185 [s] fetch(req) Fetch a scrapy.Request and update local objects 186 [s] shelp() Shell help (print this help) 187 [s] view(response) View response in a browser 188 189 >>> 190 191 192After that, we can start playing with the objects: 193 194>>> response.xpath('//title/text()').get() 195'Scrapy | A Fast and Powerful Scraping and Web Crawling Framework' 196 197>>> fetch("https://old.reddit.com/") 198 199>>> response.xpath('//title/text()').get() 200'reddit: the front page of the internet' 201 202>>> request = request.replace(method="POST") 203 204>>> fetch(request) 205 206>>> response.status 207404 208 209>>> from pprint import pprint 210 211>>> pprint(response.headers) 212{'Accept-Ranges': ['bytes'], 213 'Cache-Control': ['max-age=0, must-revalidate'], 214 'Content-Type': ['text/html; charset=UTF-8'], 215 'Date': ['Thu, 08 Dec 2016 16:21:19 GMT'], 216 'Server': ['snooserv'], 217 'Set-Cookie': ['loid=KqNLou0V9SKMX4qb4n; Domain=reddit.com; Max-Age=63071999; Path=/; expires=Sat, 08-Dec-2018 16:21:19 GMT; secure', 218 'loidcreated=2016-12-08T16%3A21%3A19.445Z; Domain=reddit.com; Max-Age=63071999; Path=/; expires=Sat, 08-Dec-2018 16:21:19 GMT; secure', 219 'loid=vi0ZVe4NkxNWdlH7r7; Domain=reddit.com; Max-Age=63071999; Path=/; expires=Sat, 08-Dec-2018 16:21:19 GMT; secure', 220 'loidcreated=2016-12-08T16%3A21%3A19.459Z; Domain=reddit.com; Max-Age=63071999; Path=/; expires=Sat, 08-Dec-2018 16:21:19 GMT; secure'], 221 'Vary': ['accept-encoding'], 222 'Via': ['1.1 varnish'], 223 'X-Cache': ['MISS'], 224 'X-Cache-Hits': ['0'], 225 'X-Content-Type-Options': ['nosniff'], 226 'X-Frame-Options': ['SAMEORIGIN'], 227 'X-Moose': ['majestic'], 228 'X-Served-By': ['cache-cdg8730-CDG'], 229 'X-Timer': ['S1481214079.394283,VS0,VE159'], 230 'X-Ua-Compatible': ['IE=edge'], 231 'X-Xss-Protection': ['1; mode=block']} 232 233 234.. _topics-shell-inspect-response: 235 236Invoking the shell from spiders to inspect responses 237==================================================== 238 239Sometimes you want to inspect the responses that are being processed in a 240certain point of your spider, if only to check that response you expect is 241getting there. 242 243This can be achieved by using the ``scrapy.shell.inspect_response`` function. 244 245Here's an example of how you would call it from your spider:: 246 247 import scrapy 248 249 250 class MySpider(scrapy.Spider): 251 name = "myspider" 252 start_urls = [ 253 "http://example.com", 254 "http://example.org", 255 "http://example.net", 256 ] 257 258 def parse(self, response): 259 # We want to inspect one specific response. 260 if ".org" in response.url: 261 from scrapy.shell import inspect_response 262 inspect_response(response, self) 263 264 # Rest of parsing code. 265 266When you run the spider, you will get something similar to this:: 267 268 2014-01-23 17:48:31-0400 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://example.com> (referer: None) 269 2014-01-23 17:48:31-0400 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://example.org> (referer: None) 270 [s] Available Scrapy objects: 271 [s] crawler <scrapy.crawler.Crawler object at 0x1e16b50> 272 ... 273 274 >>> response.url 275 'http://example.org' 276 277Then, you can check if the extraction code is working: 278 279>>> response.xpath('//h1[@class="fn"]') 280[] 281 282Nope, it doesn't. So you can open the response in your web browser and see if 283it's the response you were expecting: 284 285>>> view(response) 286True 287 288Finally you hit Ctrl-D (or Ctrl-Z in Windows) to exit the shell and resume the 289crawling:: 290 291 >>> ^D 292 2014-01-23 17:50:03-0400 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://example.net> (referer: None) 293 ... 294 295Note that you can't use the ``fetch`` shortcut here since the Scrapy engine is 296blocked by the shell. However, after you leave the shell, the spider will 297continue crawling where it stopped, as shown above. 298