1.. _urllib-howto:
2
3***********************************************************
4  HOWTO Fetch Internet Resources Using The urllib Package
5***********************************************************
6
7:Author: `Michael Foord <http://www.voidspace.org.uk/python/index.shtml>`_
8
9.. note::
10
11    There is a French translation of an earlier revision of this
12    HOWTO, available at `urllib2 - Le Manuel manquant
13    <http://www.voidspace.org.uk/python/articles/urllib2_francais.shtml>`_.
14
15
16
17Introduction
18============
19
20.. sidebar:: Related Articles
21
22    You may also find useful the following article on fetching web resources
23    with Python:
24
25    * `Basic Authentication <http://www.voidspace.org.uk/python/articles/authentication.shtml>`_
26
27        A tutorial on *Basic Authentication*, with examples in Python.
28
29**urllib.request** is a Python module for fetching URLs
30(Uniform Resource Locators). It offers a very simple interface, in the form of
31the *urlopen* function. This is capable of fetching URLs using a variety of
32different protocols. It also offers a slightly more complex interface for
33handling common situations - like basic authentication, cookies, proxies and so
34on. These are provided by objects called handlers and openers.
35
36urllib.request supports fetching URLs for many "URL schemes" (identified by the string
37before the ``":"`` in URL - for example ``"ftp"`` is the URL scheme of
38``"ftp://python.org/"``) using their associated network protocols (e.g. FTP, HTTP).
39This tutorial focuses on the most common case, HTTP.
40
41For straightforward situations *urlopen* is very easy to use. But as soon as you
42encounter errors or non-trivial cases when opening HTTP URLs, you will need some
43understanding of the HyperText Transfer Protocol. The most comprehensive and
44authoritative reference to HTTP is :rfc:`2616`. This is a technical document and
45not intended to be easy to read. This HOWTO aims to illustrate using *urllib*,
46with enough detail about HTTP to help you through. It is not intended to replace
47the :mod:`urllib.request` docs, but is supplementary to them.
48
49
50Fetching URLs
51=============
52
53The simplest way to use urllib.request is as follows::
54
55    import urllib.request
56    with urllib.request.urlopen('http://python.org/') as response:
57       html = response.read()
58
59If you wish to retrieve a resource via URL and store it in a temporary
60location, you can do so via the :func:`shutil.copyfileobj` and
61:func:`tempfile.NamedTemporaryFile` functions::
62
63    import shutil
64    import tempfile
65    import urllib.request
66
67    with urllib.request.urlopen('http://python.org/') as response:
68        with tempfile.NamedTemporaryFile(delete=False) as tmp_file:
69            shutil.copyfileobj(response, tmp_file)
70
71    with open(tmp_file.name) as html:
72        pass
73
74Many uses of urllib will be that simple (note that instead of an 'http:' URL we
75could have used a URL starting with 'ftp:', 'file:', etc.).  However, it's the
76purpose of this tutorial to explain the more complicated cases, concentrating on
77HTTP.
78
79HTTP is based on requests and responses - the client makes requests and servers
80send responses. urllib.request mirrors this with a ``Request`` object which represents
81the HTTP request you are making. In its simplest form you create a Request
82object that specifies the URL you want to fetch. Calling ``urlopen`` with this
83Request object returns a response object for the URL requested. This response is
84a file-like object, which means you can for example call ``.read()`` on the
85response::
86
87    import urllib.request
88
89    req = urllib.request.Request('http://www.voidspace.org.uk')
90    with urllib.request.urlopen(req) as response:
91       the_page = response.read()
92
93Note that urllib.request makes use of the same Request interface to handle all URL
94schemes.  For example, you can make an FTP request like so::
95
96    req = urllib.request.Request('ftp://example.com/')
97
98In the case of HTTP, there are two extra things that Request objects allow you
99to do: First, you can pass data to be sent to the server.  Second, you can pass
100extra information ("metadata") *about* the data or about the request itself, to
101the server - this information is sent as HTTP "headers".  Let's look at each of
102these in turn.
103
104Data
105----
106
107Sometimes you want to send data to a URL (often the URL will refer to a CGI
108(Common Gateway Interface) script or other web application). With HTTP,
109this is often done using what's known as a **POST** request. This is often what
110your browser does when you submit a HTML form that you filled in on the web. Not
111all POSTs have to come from forms: you can use a POST to transmit arbitrary data
112to your own application. In the common case of HTML forms, the data needs to be
113encoded in a standard way, and then passed to the Request object as the ``data``
114argument. The encoding is done using a function from the :mod:`urllib.parse`
115library. ::
116
117    import urllib.parse
118    import urllib.request
119
120    url = 'http://www.someserver.com/cgi-bin/register.cgi'
121    values = {'name' : 'Michael Foord',
122              'location' : 'Northampton',
123              'language' : 'Python' }
124
125    data = urllib.parse.urlencode(values)
126    data = data.encode('ascii') # data should be bytes
127    req = urllib.request.Request(url, data)
128    with urllib.request.urlopen(req) as response:
129       the_page = response.read()
130
131Note that other encodings are sometimes required (e.g. for file upload from HTML
132forms - see `HTML Specification, Form Submission
133<https://www.w3.org/TR/REC-html40/interact/forms.html#h-17.13>`_ for more
134details).
135
136If you do not pass the ``data`` argument, urllib uses a **GET** request. One
137way in which GET and POST requests differ is that POST requests often have
138"side-effects": they change the state of the system in some way (for example by
139placing an order with the website for a hundredweight of tinned spam to be
140delivered to your door).  Though the HTTP standard makes it clear that POSTs are
141intended to *always* cause side-effects, and GET requests *never* to cause
142side-effects, nothing prevents a GET request from having side-effects, nor a
143POST requests from having no side-effects. Data can also be passed in an HTTP
144GET request by encoding it in the URL itself.
145
146This is done as follows::
147
148    >>> import urllib.request
149    >>> import urllib.parse
150    >>> data = {}
151    >>> data['name'] = 'Somebody Here'
152    >>> data['location'] = 'Northampton'
153    >>> data['language'] = 'Python'
154    >>> url_values = urllib.parse.urlencode(data)
155    >>> print(url_values)  # The order may differ from below.  #doctest: +SKIP
156    name=Somebody+Here&language=Python&location=Northampton
157    >>> url = 'http://www.example.com/example.cgi'
158    >>> full_url = url + '?' + url_values
159    >>> data = urllib.request.urlopen(full_url)
160
161Notice that the full URL is created by adding a ``?`` to the URL, followed by
162the encoded values.
163
164Headers
165-------
166
167We'll discuss here one particular HTTP header, to illustrate how to add headers
168to your HTTP request.
169
170Some websites [#]_ dislike being browsed by programs, or send different versions
171to different browsers [#]_. By default urllib identifies itself as
172``Python-urllib/x.y`` (where ``x`` and ``y`` are the major and minor version
173numbers of the Python release,
174e.g. ``Python-urllib/2.5``), which may confuse the site, or just plain
175not work. The way a browser identifies itself is through the
176``User-Agent`` header [#]_. When you create a Request object you can
177pass a dictionary of headers in. The following example makes the same
178request as above, but identifies itself as a version of Internet
179Explorer [#]_. ::
180
181    import urllib.parse
182    import urllib.request
183
184    url = 'http://www.someserver.com/cgi-bin/register.cgi'
185    user_agent = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64)'
186    values = {'name': 'Michael Foord',
187              'location': 'Northampton',
188              'language': 'Python' }
189    headers = {'User-Agent': user_agent}
190
191    data = urllib.parse.urlencode(values)
192    data = data.encode('ascii')
193    req = urllib.request.Request(url, data, headers)
194    with urllib.request.urlopen(req) as response:
195       the_page = response.read()
196
197The response also has two useful methods. See the section on `info and geturl`_
198which comes after we have a look at what happens when things go wrong.
199
200
201Handling Exceptions
202===================
203
204*urlopen* raises :exc:`URLError` when it cannot handle a response (though as
205usual with Python APIs, built-in exceptions such as :exc:`ValueError`,
206:exc:`TypeError` etc. may also be raised).
207
208:exc:`HTTPError` is the subclass of :exc:`URLError` raised in the specific case of
209HTTP URLs.
210
211The exception classes are exported from the :mod:`urllib.error` module.
212
213URLError
214--------
215
216Often, URLError is raised because there is no network connection (no route to
217the specified server), or the specified server doesn't exist.  In this case, the
218exception raised will have a 'reason' attribute, which is a tuple containing an
219error code and a text error message.
220
221e.g. ::
222
223    >>> req = urllib.request.Request('http://www.pretend_server.org')
224    >>> try: urllib.request.urlopen(req)
225    ... except urllib.error.URLError as e:
226    ...     print(e.reason)      #doctest: +SKIP
227    ...
228    (4, 'getaddrinfo failed')
229
230
231HTTPError
232---------
233
234Every HTTP response from the server contains a numeric "status code". Sometimes
235the status code indicates that the server is unable to fulfil the request. The
236default handlers will handle some of these responses for you (for example, if
237the response is a "redirection" that requests the client fetch the document from
238a different URL, urllib will handle that for you). For those it can't handle,
239urlopen will raise an :exc:`HTTPError`. Typical errors include '404' (page not
240found), '403' (request forbidden), and '401' (authentication required).
241
242See section 10 of :rfc:`2616` for a reference on all the HTTP error codes.
243
244The :exc:`HTTPError` instance raised will have an integer 'code' attribute, which
245corresponds to the error sent by the server.
246
247Error Codes
248~~~~~~~~~~~
249
250Because the default handlers handle redirects (codes in the 300 range), and
251codes in the 100--299 range indicate success, you will usually only see error
252codes in the 400--599 range.
253
254:attr:`http.server.BaseHTTPRequestHandler.responses` is a useful dictionary of
255response codes in that shows all the response codes used by :rfc:`2616`. The
256dictionary is reproduced here for convenience ::
257
258    # Table mapping response codes to messages; entries have the
259    # form {code: (shortmessage, longmessage)}.
260    responses = {
261        100: ('Continue', 'Request received, please continue'),
262        101: ('Switching Protocols',
263              'Switching to new protocol; obey Upgrade header'),
264
265        200: ('OK', 'Request fulfilled, document follows'),
266        201: ('Created', 'Document created, URL follows'),
267        202: ('Accepted',
268              'Request accepted, processing continues off-line'),
269        203: ('Non-Authoritative Information', 'Request fulfilled from cache'),
270        204: ('No Content', 'Request fulfilled, nothing follows'),
271        205: ('Reset Content', 'Clear input form for further input.'),
272        206: ('Partial Content', 'Partial content follows.'),
273
274        300: ('Multiple Choices',
275              'Object has several resources -- see URI list'),
276        301: ('Moved Permanently', 'Object moved permanently -- see URI list'),
277        302: ('Found', 'Object moved temporarily -- see URI list'),
278        303: ('See Other', 'Object moved -- see Method and URL list'),
279        304: ('Not Modified',
280              'Document has not changed since given time'),
281        305: ('Use Proxy',
282              'You must use proxy specified in Location to access this '
283              'resource.'),
284        307: ('Temporary Redirect',
285              'Object moved temporarily -- see URI list'),
286
287        400: ('Bad Request',
288              'Bad request syntax or unsupported method'),
289        401: ('Unauthorized',
290              'No permission -- see authorization schemes'),
291        402: ('Payment Required',
292              'No payment -- see charging schemes'),
293        403: ('Forbidden',
294              'Request forbidden -- authorization will not help'),
295        404: ('Not Found', 'Nothing matches the given URI'),
296        405: ('Method Not Allowed',
297              'Specified method is invalid for this server.'),
298        406: ('Not Acceptable', 'URI not available in preferred format.'),
299        407: ('Proxy Authentication Required', 'You must authenticate with '
300              'this proxy before proceeding.'),
301        408: ('Request Timeout', 'Request timed out; try again later.'),
302        409: ('Conflict', 'Request conflict.'),
303        410: ('Gone',
304              'URI no longer exists and has been permanently removed.'),
305        411: ('Length Required', 'Client must specify Content-Length.'),
306        412: ('Precondition Failed', 'Precondition in headers is false.'),
307        413: ('Request Entity Too Large', 'Entity is too large.'),
308        414: ('Request-URI Too Long', 'URI is too long.'),
309        415: ('Unsupported Media Type', 'Entity body in unsupported format.'),
310        416: ('Requested Range Not Satisfiable',
311              'Cannot satisfy request range.'),
312        417: ('Expectation Failed',
313              'Expect condition could not be satisfied.'),
314
315        500: ('Internal Server Error', 'Server got itself in trouble'),
316        501: ('Not Implemented',
317              'Server does not support this operation'),
318        502: ('Bad Gateway', 'Invalid responses from another server/proxy.'),
319        503: ('Service Unavailable',
320              'The server cannot process the request due to a high load'),
321        504: ('Gateway Timeout',
322              'The gateway server did not receive a timely response'),
323        505: ('HTTP Version Not Supported', 'Cannot fulfill request.'),
324        }
325
326When an error is raised the server responds by returning an HTTP error code
327*and* an error page. You can use the :exc:`HTTPError` instance as a response on the
328page returned. This means that as well as the code attribute, it also has read,
329geturl, and info, methods as returned by the ``urllib.response`` module::
330
331    >>> req = urllib.request.Request('http://www.python.org/fish.html')
332    >>> try:
333    ...     urllib.request.urlopen(req)
334    ... except urllib.error.HTTPError as e:
335    ...     print(e.code)
336    ...     print(e.read())  #doctest: +ELLIPSIS, +NORMALIZE_WHITESPACE
337    ...
338    404
339    b'<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
340      "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\n\n\n<html
341      ...
342      <title>Page Not Found</title>\n
343      ...
344
345Wrapping it Up
346--------------
347
348So if you want to be prepared for :exc:`HTTPError` *or* :exc:`URLError` there are two
349basic approaches. I prefer the second approach.
350
351Number 1
352~~~~~~~~
353
354::
355
356
357    from urllib.request import Request, urlopen
358    from urllib.error import URLError, HTTPError
359    req = Request(someurl)
360    try:
361        response = urlopen(req)
362    except HTTPError as e:
363        print('The server couldn\'t fulfill the request.')
364        print('Error code: ', e.code)
365    except URLError as e:
366        print('We failed to reach a server.')
367        print('Reason: ', e.reason)
368    else:
369        # everything is fine
370
371
372.. note::
373
374    The ``except HTTPError`` *must* come first, otherwise ``except URLError``
375    will *also* catch an :exc:`HTTPError`.
376
377Number 2
378~~~~~~~~
379
380::
381
382    from urllib.request import Request, urlopen
383    from urllib.error import URLError
384    req = Request(someurl)
385    try:
386        response = urlopen(req)
387    except URLError as e:
388        if hasattr(e, 'reason'):
389            print('We failed to reach a server.')
390            print('Reason: ', e.reason)
391        elif hasattr(e, 'code'):
392            print('The server couldn\'t fulfill the request.')
393            print('Error code: ', e.code)
394    else:
395        # everything is fine
396
397
398info and geturl
399===============
400
401The response returned by urlopen (or the :exc:`HTTPError` instance) has two
402useful methods :meth:`info` and :meth:`geturl` and is defined in the module
403:mod:`urllib.response`..
404
405**geturl** - this returns the real URL of the page fetched. This is useful
406because ``urlopen`` (or the opener object used) may have followed a
407redirect. The URL of the page fetched may not be the same as the URL requested.
408
409**info** - this returns a dictionary-like object that describes the page
410fetched, particularly the headers sent by the server. It is currently an
411:class:`http.client.HTTPMessage` instance.
412
413Typical headers include 'Content-length', 'Content-type', and so on. See the
414`Quick Reference to HTTP Headers <http://jkorpela.fi/http.html>`_
415for a useful listing of HTTP headers with brief explanations of their meaning
416and use.
417
418
419Openers and Handlers
420====================
421
422When you fetch a URL you use an opener (an instance of the perhaps
423confusingly-named :class:`urllib.request.OpenerDirector`). Normally we have been using
424the default opener - via ``urlopen`` - but you can create custom
425openers. Openers use handlers. All the "heavy lifting" is done by the
426handlers. Each handler knows how to open URLs for a particular URL scheme (http,
427ftp, etc.), or how to handle an aspect of URL opening, for example HTTP
428redirections or HTTP cookies.
429
430You will want to create openers if you want to fetch URLs with specific handlers
431installed, for example to get an opener that handles cookies, or to get an
432opener that does not handle redirections.
433
434To create an opener, instantiate an ``OpenerDirector``, and then call
435``.add_handler(some_handler_instance)`` repeatedly.
436
437Alternatively, you can use ``build_opener``, which is a convenience function for
438creating opener objects with a single function call.  ``build_opener`` adds
439several handlers by default, but provides a quick way to add more and/or
440override the default handlers.
441
442Other sorts of handlers you might want to can handle proxies, authentication,
443and other common but slightly specialised situations.
444
445``install_opener`` can be used to make an ``opener`` object the (global) default
446opener. This means that calls to ``urlopen`` will use the opener you have
447installed.
448
449Opener objects have an ``open`` method, which can be called directly to fetch
450urls in the same way as the ``urlopen`` function: there's no need to call
451``install_opener``, except as a convenience.
452
453
454Basic Authentication
455====================
456
457To illustrate creating and installing a handler we will use the
458``HTTPBasicAuthHandler``. For a more detailed discussion of this subject --
459including an explanation of how Basic Authentication works - see the `Basic
460Authentication Tutorial
461<http://www.voidspace.org.uk/python/articles/authentication.shtml>`_.
462
463When authentication is required, the server sends a header (as well as the 401
464error code) requesting authentication.  This specifies the authentication scheme
465and a 'realm'. The header looks like: ``WWW-Authenticate: SCHEME
466realm="REALM"``.
467
468e.g.
469
470.. code-block:: none
471
472    WWW-Authenticate: Basic realm="cPanel Users"
473
474
475The client should then retry the request with the appropriate name and password
476for the realm included as a header in the request. This is 'basic
477authentication'. In order to simplify this process we can create an instance of
478``HTTPBasicAuthHandler`` and an opener to use this handler.
479
480The ``HTTPBasicAuthHandler`` uses an object called a password manager to handle
481the mapping of URLs and realms to passwords and usernames. If you know what the
482realm is (from the authentication header sent by the server), then you can use a
483``HTTPPasswordMgr``. Frequently one doesn't care what the realm is. In that
484case, it is convenient to use ``HTTPPasswordMgrWithDefaultRealm``. This allows
485you to specify a default username and password for a URL. This will be supplied
486in the absence of you providing an alternative combination for a specific
487realm. We indicate this by providing ``None`` as the realm argument to the
488``add_password`` method.
489
490The top-level URL is the first URL that requires authentication. URLs "deeper"
491than the URL you pass to .add_password() will also match. ::
492
493    # create a password manager
494    password_mgr = urllib.request.HTTPPasswordMgrWithDefaultRealm()
495
496    # Add the username and password.
497    # If we knew the realm, we could use it instead of None.
498    top_level_url = "http://example.com/foo/"
499    password_mgr.add_password(None, top_level_url, username, password)
500
501    handler = urllib.request.HTTPBasicAuthHandler(password_mgr)
502
503    # create "opener" (OpenerDirector instance)
504    opener = urllib.request.build_opener(handler)
505
506    # use the opener to fetch a URL
507    opener.open(a_url)
508
509    # Install the opener.
510    # Now all calls to urllib.request.urlopen use our opener.
511    urllib.request.install_opener(opener)
512
513.. note::
514
515    In the above example we only supplied our ``HTTPBasicAuthHandler`` to
516    ``build_opener``. By default openers have the handlers for normal situations
517    -- ``ProxyHandler`` (if a proxy setting such as an :envvar:`http_proxy`
518    environment variable is set), ``UnknownHandler``, ``HTTPHandler``,
519    ``HTTPDefaultErrorHandler``, ``HTTPRedirectHandler``, ``FTPHandler``,
520    ``FileHandler``, ``DataHandler``, ``HTTPErrorProcessor``.
521
522``top_level_url`` is in fact *either* a full URL (including the 'http:' scheme
523component and the hostname and optionally the port number)
524e.g. ``"http://example.com/"`` *or* an "authority" (i.e. the hostname,
525optionally including the port number) e.g. ``"example.com"`` or ``"example.com:8080"``
526(the latter example includes a port number).  The authority, if present, must
527NOT contain the "userinfo" component - for example ``"joe:password@example.com"`` is
528not correct.
529
530
531Proxies
532=======
533
534**urllib** will auto-detect your proxy settings and use those. This is through
535the ``ProxyHandler``, which is part of the normal handler chain when a proxy
536setting is detected.  Normally that's a good thing, but there are occasions
537when it may not be helpful [#]_. One way to do this is to setup our own
538``ProxyHandler``, with no proxies defined. This is done using similar steps to
539setting up a `Basic Authentication`_ handler: ::
540
541    >>> proxy_support = urllib.request.ProxyHandler({})
542    >>> opener = urllib.request.build_opener(proxy_support)
543    >>> urllib.request.install_opener(opener)
544
545.. note::
546
547    Currently ``urllib.request`` *does not* support fetching of ``https`` locations
548    through a proxy.  However, this can be enabled by extending urllib.request as
549    shown in the recipe [#]_.
550
551.. note::
552
553    ``HTTP_PROXY`` will be ignored if a variable ``REQUEST_METHOD`` is set; see
554    the documentation on :func:`~urllib.request.getproxies`.
555
556
557Sockets and Layers
558==================
559
560The Python support for fetching resources from the web is layered.  urllib uses
561the :mod:`http.client` library, which in turn uses the socket library.
562
563As of Python 2.3 you can specify how long a socket should wait for a response
564before timing out. This can be useful in applications which have to fetch web
565pages. By default the socket module has *no timeout* and can hang. Currently,
566the socket timeout is not exposed at the http.client or urllib.request levels.
567However, you can set the default timeout globally for all sockets using ::
568
569    import socket
570    import urllib.request
571
572    # timeout in seconds
573    timeout = 10
574    socket.setdefaulttimeout(timeout)
575
576    # this call to urllib.request.urlopen now uses the default timeout
577    # we have set in the socket module
578    req = urllib.request.Request('http://www.voidspace.org.uk')
579    response = urllib.request.urlopen(req)
580
581
582-------
583
584
585Footnotes
586=========
587
588This document was reviewed and revised by John Lee.
589
590.. [#] Google for example.
591.. [#] Browser sniffing is a very bad practice for website design - building
592       sites using web standards is much more sensible. Unfortunately a lot of
593       sites still send different versions to different browsers.
594.. [#] The user agent for MSIE 6 is
595       *'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)'*
596.. [#] For details of more HTTP request headers, see
597       `Quick Reference to HTTP Headers`_.
598.. [#] In my case I have to use a proxy to access the internet at work. If you
599       attempt to fetch *localhost* URLs through this proxy it blocks them. IE
600       is set to use the proxy, which urllib picks up on. In order to test
601       scripts with a localhost server, I have to prevent urllib from using
602       the proxy.
603.. [#] urllib opener for SSL proxy (CONNECT method): `ASPN Cookbook Recipe
604       <https://code.activestate.com/recipes/456195/>`_.
605
606