1:mod:`urllib.robotparser` ---  Parser for robots.txt
2====================================================
3
4.. module:: urllib.robotparser
5   :synopsis: Load a robots.txt file and answer questions about
6              fetchability of other URLs.
7
8.. sectionauthor:: Skip Montanaro <skip@pobox.com>
9
10**Source code:** :source:`Lib/urllib/robotparser.py`
11
12.. index::
13   single: WWW
14   single: World Wide Web
15   single: URL
16   single: robots.txt
17
18--------------
19
20This module provides a single class, :class:`RobotFileParser`, which answers
21questions about whether or not a particular user agent can fetch a URL on the
22Web site that published the :file:`robots.txt` file.  For more details on the
23structure of :file:`robots.txt` files, see http://www.robotstxt.org/orig.html.
24
25
26.. class:: RobotFileParser(url='')
27
28   This class provides methods to read, parse and answer questions about the
29   :file:`robots.txt` file at *url*.
30
31   .. method:: set_url(url)
32
33      Sets the URL referring to a :file:`robots.txt` file.
34
35   .. method:: read()
36
37      Reads the :file:`robots.txt` URL and feeds it to the parser.
38
39   .. method:: parse(lines)
40
41      Parses the lines argument.
42
43   .. method:: can_fetch(useragent, url)
44
45      Returns ``True`` if the *useragent* is allowed to fetch the *url*
46      according to the rules contained in the parsed :file:`robots.txt`
47      file.
48
49   .. method:: mtime()
50
51      Returns the time the ``robots.txt`` file was last fetched.  This is
52      useful for long-running web spiders that need to check for new
53      ``robots.txt`` files periodically.
54
55   .. method:: modified()
56
57      Sets the time the ``robots.txt`` file was last fetched to the current
58      time.
59
60   .. method:: crawl_delay(useragent)
61
62      Returns the value of the ``Crawl-delay`` parameter from ``robots.txt``
63      for the *useragent* in question.  If there is no such parameter or it
64      doesn't apply to the *useragent* specified or the ``robots.txt`` entry
65      for this parameter has invalid syntax, return ``None``.
66
67      .. versionadded:: 3.6
68
69   .. method:: request_rate(useragent)
70
71      Returns the contents of the ``Request-rate`` parameter from
72      ``robots.txt`` as a :term:`named tuple` ``RequestRate(requests, seconds)``.
73      If there is no such parameter or it doesn't apply to the *useragent*
74      specified or the ``robots.txt`` entry for this parameter has invalid
75      syntax, return ``None``.
76
77      .. versionadded:: 3.6
78
79   .. method:: site_maps()
80
81      Returns the contents of the ``Sitemap`` parameter from
82      ``robots.txt`` in the form of a :func:`list`. If there is no such
83      parameter or the ``robots.txt`` entry for this parameter has
84      invalid syntax, return ``None``.
85
86      .. versionadded:: 3.8
87
88
89The following example demonstrates basic use of the :class:`RobotFileParser`
90class::
91
92   >>> import urllib.robotparser
93   >>> rp = urllib.robotparser.RobotFileParser()
94   >>> rp.set_url("http://www.musi-cal.com/robots.txt")
95   >>> rp.read()
96   >>> rrate = rp.request_rate("*")
97   >>> rrate.requests
98   3
99   >>> rrate.seconds
100   20
101   >>> rp.crawl_delay("*")
102   6
103   >>> rp.can_fetch("*", "http://www.musi-cal.com/cgi-bin/search?city=San+Francisco")
104   False
105   >>> rp.can_fetch("*", "http://www.musi-cal.com/")
106   True
107