1.. _topics-shell:
2
3============
4Scrapy shell
5============
6
7The Scrapy shell is an interactive shell where you can try and debug your
8scraping code very quickly, without having to run the spider. It's meant to be
9used for testing data extraction code, but you can actually use it for testing
10any kind of code as it is also a regular Python shell.
11
12The shell is used for testing XPath or CSS expressions and see how they work
13and what data they extract from the web pages you're trying to scrape. It
14allows you to interactively test your expressions while you're writing your
15spider, without having to run the spider to test every change.
16
17Once you get familiarized with the Scrapy shell, you'll see that it's an
18invaluable tool for developing and debugging your spiders.
19
20Configuring the shell
21=====================
22
23If you have `IPython`_ installed, the Scrapy shell will use it (instead of the
24standard Python console). The `IPython`_ console is much more powerful and
25provides smart auto-completion and colorized output, among other things.
26
27We highly recommend you install `IPython`_, specially if you're working on
28Unix systems (where `IPython`_ excels). See the `IPython installation guide`_
29for more info.
30
31Scrapy also has support for `bpython`_, and will try to use it where `IPython`_
32is unavailable.
33
34Through Scrapy's settings you can configure it to use any one of
35``ipython``, ``bpython`` or the standard ``python`` shell, regardless of which
36are installed. This is done by setting the ``SCRAPY_PYTHON_SHELL`` environment
37variable; or by defining it in your :ref:`scrapy.cfg <topics-config-settings>`::
38
39    [settings]
40    shell = bpython
41
42.. _IPython: https://ipython.org/
43.. _IPython installation guide: https://ipython.org/install.html
44.. _bpython: https://bpython-interpreter.org/
45
46Launch the shell
47================
48
49To launch the Scrapy shell you can use the :command:`shell` command like
50this::
51
52    scrapy shell <url>
53
54Where the ``<url>`` is the URL you want to scrape.
55
56:command:`shell` also works for local files. This can be handy if you want
57to play around with a local copy of a web page. :command:`shell` understands
58the following syntaxes for local files::
59
60    # UNIX-style
61    scrapy shell ./path/to/file.html
62    scrapy shell ../other/path/to/file.html
63    scrapy shell /absolute/path/to/file.html
64
65    # File URI
66    scrapy shell file:///absolute/path/to/file.html
67
68.. note:: When using relative file paths, be explicit and prepend them
69    with ``./`` (or ``../`` when relevant).
70    ``scrapy shell index.html`` will not work as one might expect (and
71    this is by design, not a bug).
72
73    Because :command:`shell` favors HTTP URLs over File URIs,
74    and ``index.html`` being syntactically similar to ``example.com``,
75    :command:`shell` will treat ``index.html`` as a domain name and trigger
76    a DNS lookup error::
77
78        $ scrapy shell index.html
79        [ ... scrapy shell starts ... ]
80        [ ... traceback ... ]
81        twisted.internet.error.DNSLookupError: DNS lookup failed:
82        address 'index.html' not found: [Errno -5] No address associated with hostname.
83
84    :command:`shell` will not test beforehand if a file called ``index.html``
85    exists in the current directory. Again, be explicit.
86
87
88Using the shell
89===============
90
91The Scrapy shell is just a regular Python console (or `IPython`_ console if you
92have it available) which provides some additional shortcut functions for
93convenience.
94
95Available Shortcuts
96-------------------
97
98-   ``shelp()`` - print a help with the list of available objects and
99    shortcuts
100
101-   ``fetch(url[, redirect=True])`` - fetch a new response from the given URL
102    and update all related objects accordingly. You can optionaly ask for HTTP
103    3xx redirections to not be followed by passing ``redirect=False``
104
105-   ``fetch(request)`` - fetch a new response from the given request and update
106    all related objects accordingly.
107
108-   ``view(response)`` - open the given response in your local web browser, for
109    inspection. This will add a `\<base\> tag`_ to the response body in order
110    for external links (such as images and style sheets) to display properly.
111    Note, however, that this will create a temporary file in your computer,
112    which won't be removed automatically.
113
114.. _<base> tag: https://developer.mozilla.org/en-US/docs/Web/HTML/Element/base
115
116Available Scrapy objects
117------------------------
118
119The Scrapy shell automatically creates some convenient objects from the
120downloaded page, like the :class:`~scrapy.http.Response` object and the
121:class:`~scrapy.selector.Selector` objects (for both HTML and XML
122content).
123
124Those objects are:
125
126-    ``crawler`` - the current :class:`~scrapy.crawler.Crawler` object.
127
128-   ``spider`` - the Spider which is known to handle the URL, or a
129    :class:`~scrapy.spiders.Spider` object if there is no spider found for the
130    current URL
131
132-   ``request`` - a :class:`~scrapy.http.Request` object of the last fetched
133    page. You can modify this request using
134    :meth:`~scrapy.http.Request.replace` or fetch a new request (without
135    leaving the shell) using the ``fetch`` shortcut.
136
137-   ``response`` - a :class:`~scrapy.http.Response` object containing the last
138    fetched page
139
140-   ``settings`` - the current :ref:`Scrapy settings <topics-settings>`
141
142Example of shell session
143========================
144
145Here's an example of a typical shell session where we start by scraping the
146https://scrapy.org page, and then proceed to scrape the https://old.reddit.com/
147page. Finally, we modify the (Reddit) request method to POST and re-fetch it
148getting an error. We end the session by typing Ctrl-D (in Unix systems) or
149Ctrl-Z in Windows.
150
151Keep in mind that the data extracted here may not be the same when you try it,
152as those pages are not static and could have changed by the time you test this.
153The only purpose of this example is to get you familiarized with how the Scrapy
154shell works.
155
156First, we launch the shell::
157
158    scrapy shell 'https://scrapy.org' --nolog
159
160.. note::
161
162   Remember to always enclose URLs in quotes when running the Scrapy shell from
163   the command line, otherwise URLs containing arguments (i.e. the ``&`` character)
164   will not work.
165
166   On Windows, use double quotes instead::
167
168       scrapy shell "https://scrapy.org" --nolog
169
170
171Then, the shell fetches the URL (using the Scrapy downloader) and prints the
172list of available objects and useful shortcuts (you'll notice that these lines
173all start with the ``[s]`` prefix)::
174
175    [s] Available Scrapy objects:
176    [s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
177    [s]   crawler    <scrapy.crawler.Crawler object at 0x7f07395dd690>
178    [s]   item       {}
179    [s]   request    <GET https://scrapy.org>
180    [s]   response   <200 https://scrapy.org/>
181    [s]   settings   <scrapy.settings.Settings object at 0x7f07395dd710>
182    [s]   spider     <DefaultSpider 'default' at 0x7f0735891690>
183    [s] Useful shortcuts:
184    [s]   fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
185    [s]   fetch(req)                  Fetch a scrapy.Request and update local objects
186    [s]   shelp()           Shell help (print this help)
187    [s]   view(response)    View response in a browser
188
189    >>>
190
191
192After that, we can start playing with the objects:
193
194>>> response.xpath('//title/text()').get()
195'Scrapy | A Fast and Powerful Scraping and Web Crawling Framework'
196
197>>> fetch("https://old.reddit.com/")
198
199>>> response.xpath('//title/text()').get()
200'reddit: the front page of the internet'
201
202>>> request = request.replace(method="POST")
203
204>>> fetch(request)
205
206>>> response.status
207404
208
209>>> from pprint import pprint
210
211>>> pprint(response.headers)
212{'Accept-Ranges': ['bytes'],
213 'Cache-Control': ['max-age=0, must-revalidate'],
214 'Content-Type': ['text/html; charset=UTF-8'],
215 'Date': ['Thu, 08 Dec 2016 16:21:19 GMT'],
216 'Server': ['snooserv'],
217 'Set-Cookie': ['loid=KqNLou0V9SKMX4qb4n; Domain=reddit.com; Max-Age=63071999; Path=/; expires=Sat, 08-Dec-2018 16:21:19 GMT; secure',
218                'loidcreated=2016-12-08T16%3A21%3A19.445Z; Domain=reddit.com; Max-Age=63071999; Path=/; expires=Sat, 08-Dec-2018 16:21:19 GMT; secure',
219                'loid=vi0ZVe4NkxNWdlH7r7; Domain=reddit.com; Max-Age=63071999; Path=/; expires=Sat, 08-Dec-2018 16:21:19 GMT; secure',
220                'loidcreated=2016-12-08T16%3A21%3A19.459Z; Domain=reddit.com; Max-Age=63071999; Path=/; expires=Sat, 08-Dec-2018 16:21:19 GMT; secure'],
221 'Vary': ['accept-encoding'],
222 'Via': ['1.1 varnish'],
223 'X-Cache': ['MISS'],
224 'X-Cache-Hits': ['0'],
225 'X-Content-Type-Options': ['nosniff'],
226 'X-Frame-Options': ['SAMEORIGIN'],
227 'X-Moose': ['majestic'],
228 'X-Served-By': ['cache-cdg8730-CDG'],
229 'X-Timer': ['S1481214079.394283,VS0,VE159'],
230 'X-Ua-Compatible': ['IE=edge'],
231 'X-Xss-Protection': ['1; mode=block']}
232
233
234.. _topics-shell-inspect-response:
235
236Invoking the shell from spiders to inspect responses
237====================================================
238
239Sometimes you want to inspect the responses that are being processed in a
240certain point of your spider, if only to check that response you expect is
241getting there.
242
243This can be achieved by using the ``scrapy.shell.inspect_response`` function.
244
245Here's an example of how you would call it from your spider::
246
247    import scrapy
248
249
250    class MySpider(scrapy.Spider):
251        name = "myspider"
252        start_urls = [
253            "http://example.com",
254            "http://example.org",
255            "http://example.net",
256        ]
257
258        def parse(self, response):
259            # We want to inspect one specific response.
260            if ".org" in response.url:
261                from scrapy.shell import inspect_response
262                inspect_response(response, self)
263
264            # Rest of parsing code.
265
266When you run the spider, you will get something similar to this::
267
268    2014-01-23 17:48:31-0400 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://example.com> (referer: None)
269    2014-01-23 17:48:31-0400 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://example.org> (referer: None)
270    [s] Available Scrapy objects:
271    [s]   crawler    <scrapy.crawler.Crawler object at 0x1e16b50>
272    ...
273
274    >>> response.url
275    'http://example.org'
276
277Then, you can check if the extraction code is working:
278
279>>> response.xpath('//h1[@class="fn"]')
280[]
281
282Nope, it doesn't. So you can open the response in your web browser and see if
283it's the response you were expecting:
284
285>>> view(response)
286True
287
288Finally you hit Ctrl-D (or Ctrl-Z in Windows) to exit the shell and resume the
289crawling::
290
291    >>> ^D
292    2014-01-23 17:50:03-0400 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://example.net> (referer: None)
293    ...
294
295Note that you can't use the ``fetch`` shortcut here since the Scrapy engine is
296blocked by the shell. However, after you leave the shell, the spider will
297continue crawling where it stopped, as shown above.
298