1:mod:`tokenize` --- Tokenizer for Python source
2===============================================
3
4.. module:: tokenize
5   :synopsis: Lexical scanner for Python source code.
6
7.. moduleauthor:: Ka Ping Yee
8.. sectionauthor:: Fred L. Drake, Jr. <fdrake@acm.org>
9
10**Source code:** :source:`Lib/tokenize.py`
11
12--------------
13
14The :mod:`tokenize` module provides a lexical scanner for Python source code,
15implemented in Python.  The scanner in this module returns comments as tokens
16as well, making it useful for implementing "pretty-printers", including
17colorizers for on-screen displays.
18
19To simplify token stream handling, all :ref:`operator <operators>` and
20:ref:`delimiter <delimiters>` tokens and :data:`Ellipsis` are returned using
21the generic :data:`~token.OP` token type.  The exact
22type can be determined by checking the ``exact_type`` property on the
23:term:`named tuple` returned from :func:`tokenize.tokenize`.
24
25Tokenizing Input
26----------------
27
28The primary entry point is a :term:`generator`:
29
30.. function:: tokenize(readline)
31
32   The :func:`.tokenize` generator requires one argument, *readline*, which
33   must be a callable object which provides the same interface as the
34   :meth:`io.IOBase.readline` method of file objects.  Each call to the
35   function should return one line of input as bytes.
36
37   The generator produces 5-tuples with these members: the token type; the
38   token string; a 2-tuple ``(srow, scol)`` of ints specifying the row and
39   column where the token begins in the source; a 2-tuple ``(erow, ecol)`` of
40   ints specifying the row and column where the token ends in the source; and
41   the line on which the token was found. The line passed (the last tuple item)
42   is the *physical* line.  The 5 tuple is returned as a :term:`named tuple`
43   with the field names:
44   ``type string start end line``.
45
46   The returned :term:`named tuple` has an additional property named
47   ``exact_type`` that contains the exact operator type for
48   :data:`~token.OP` tokens.  For all other token types ``exact_type``
49   equals the named tuple ``type`` field.
50
51   .. versionchanged:: 3.1
52      Added support for named tuples.
53
54   .. versionchanged:: 3.3
55      Added support for ``exact_type``.
56
57   :func:`.tokenize` determines the source encoding of the file by looking for a
58   UTF-8 BOM or encoding cookie, according to :pep:`263`.
59
60.. function:: generate_tokens(readline)
61
62   Tokenize a source reading unicode strings instead of bytes.
63
64   Like :func:`.tokenize`, the *readline* argument is a callable returning
65   a single line of input. However, :func:`generate_tokens` expects *readline*
66   to return a str object rather than bytes.
67
68   The result is an iterator yielding named tuples, exactly like
69   :func:`.tokenize`. It does not yield an :data:`~token.ENCODING` token.
70
71All constants from the :mod:`token` module are also exported from
72:mod:`tokenize`.
73
74Another function is provided to reverse the tokenization process. This is
75useful for creating tools that tokenize a script, modify the token stream, and
76write back the modified script.
77
78
79.. function:: untokenize(iterable)
80
81    Converts tokens back into Python source code.  The *iterable* must return
82    sequences with at least two elements, the token type and the token string.
83    Any additional sequence elements are ignored.
84
85    The reconstructed script is returned as a single string.  The result is
86    guaranteed to tokenize back to match the input so that the conversion is
87    lossless and round-trips are assured.  The guarantee applies only to the
88    token type and token string as the spacing between tokens (column
89    positions) may change.
90
91    It returns bytes, encoded using the :data:`~token.ENCODING` token, which
92    is the first token sequence output by :func:`.tokenize`. If there is no
93    encoding token in the input, it returns a str instead.
94
95
96:func:`.tokenize` needs to detect the encoding of source files it tokenizes. The
97function it uses to do this is available:
98
99.. function:: detect_encoding(readline)
100
101    The :func:`detect_encoding` function is used to detect the encoding that
102    should be used to decode a Python source file. It requires one argument,
103    readline, in the same way as the :func:`.tokenize` generator.
104
105    It will call readline a maximum of twice, and return the encoding used
106    (as a string) and a list of any lines (not decoded from bytes) it has read
107    in.
108
109    It detects the encoding from the presence of a UTF-8 BOM or an encoding
110    cookie as specified in :pep:`263`. If both a BOM and a cookie are present,
111    but disagree, a :exc:`SyntaxError` will be raised. Note that if the BOM is found,
112    ``'utf-8-sig'`` will be returned as an encoding.
113
114    If no encoding is specified, then the default of ``'utf-8'`` will be
115    returned.
116
117    Use :func:`.open` to open Python source files: it uses
118    :func:`detect_encoding` to detect the file encoding.
119
120
121.. function:: open(filename)
122
123   Open a file in read only mode using the encoding detected by
124   :func:`detect_encoding`.
125
126   .. versionadded:: 3.2
127
128.. exception:: TokenError
129
130   Raised when either a docstring or expression that may be split over several
131   lines is not completed anywhere in the file, for example::
132
133      """Beginning of
134      docstring
135
136   or::
137
138      [1,
139       2,
140       3
141
142Note that unclosed single-quoted strings do not cause an error to be
143raised. They are tokenized as :data:`~token.ERRORTOKEN`, followed by the
144tokenization of their contents.
145
146
147.. _tokenize-cli:
148
149Command-Line Usage
150------------------
151
152.. versionadded:: 3.3
153
154The :mod:`tokenize` module can be executed as a script from the command line.
155It is as simple as:
156
157.. code-block:: sh
158
159   python -m tokenize [-e] [filename.py]
160
161The following options are accepted:
162
163.. program:: tokenize
164
165.. cmdoption:: -h, --help
166
167   show this help message and exit
168
169.. cmdoption:: -e, --exact
170
171   display token names using the exact type
172
173If :file:`filename.py` is specified its contents are tokenized to stdout.
174Otherwise, tokenization is performed on stdin.
175
176Examples
177------------------
178
179Example of a script rewriter that transforms float literals into Decimal
180objects::
181
182    from tokenize import tokenize, untokenize, NUMBER, STRING, NAME, OP
183    from io import BytesIO
184
185    def decistmt(s):
186        """Substitute Decimals for floats in a string of statements.
187
188        >>> from decimal import Decimal
189        >>> s = 'print(+21.3e-5*-.1234/81.7)'
190        >>> decistmt(s)
191        "print (+Decimal ('21.3e-5')*-Decimal ('.1234')/Decimal ('81.7'))"
192
193        The format of the exponent is inherited from the platform C library.
194        Known cases are "e-007" (Windows) and "e-07" (not Windows).  Since
195        we're only showing 12 digits, and the 13th isn't close to 5, the
196        rest of the output should be platform-independent.
197
198        >>> exec(s)  #doctest: +ELLIPSIS
199        -3.21716034272e-0...7
200
201        Output from calculations with Decimal should be identical across all
202        platforms.
203
204        >>> exec(decistmt(s))
205        -3.217160342717258261933904529E-7
206        """
207        result = []
208        g = tokenize(BytesIO(s.encode('utf-8')).readline)  # tokenize the string
209        for toknum, tokval, _, _, _ in g:
210            if toknum == NUMBER and '.' in tokval:  # replace NUMBER tokens
211                result.extend([
212                    (NAME, 'Decimal'),
213                    (OP, '('),
214                    (STRING, repr(tokval)),
215                    (OP, ')')
216                ])
217            else:
218                result.append((toknum, tokval))
219        return untokenize(result).decode('utf-8')
220
221Example of tokenizing from the command line.  The script::
222
223    def say_hello():
224        print("Hello, World!")
225
226    say_hello()
227
228will be tokenized to the following output where the first column is the range
229of the line/column coordinates where the token is found, the second column is
230the name of the token, and the final column is the value of the token (if any)
231
232.. code-block:: shell-session
233
234    $ python -m tokenize hello.py
235    0,0-0,0:            ENCODING       'utf-8'
236    1,0-1,3:            NAME           'def'
237    1,4-1,13:           NAME           'say_hello'
238    1,13-1,14:          OP             '('
239    1,14-1,15:          OP             ')'
240    1,15-1,16:          OP             ':'
241    1,16-1,17:          NEWLINE        '\n'
242    2,0-2,4:            INDENT         '    '
243    2,4-2,9:            NAME           'print'
244    2,9-2,10:           OP             '('
245    2,10-2,25:          STRING         '"Hello, World!"'
246    2,25-2,26:          OP             ')'
247    2,26-2,27:          NEWLINE        '\n'
248    3,0-3,1:            NL             '\n'
249    4,0-4,0:            DEDENT         ''
250    4,0-4,9:            NAME           'say_hello'
251    4,9-4,10:           OP             '('
252    4,10-4,11:          OP             ')'
253    4,11-4,12:          NEWLINE        '\n'
254    5,0-5,0:            ENDMARKER      ''
255
256The exact token type names can be displayed using the :option:`-e` option:
257
258.. code-block:: shell-session
259
260    $ python -m tokenize -e hello.py
261    0,0-0,0:            ENCODING       'utf-8'
262    1,0-1,3:            NAME           'def'
263    1,4-1,13:           NAME           'say_hello'
264    1,13-1,14:          LPAR           '('
265    1,14-1,15:          RPAR           ')'
266    1,15-1,16:          COLON          ':'
267    1,16-1,17:          NEWLINE        '\n'
268    2,0-2,4:            INDENT         '    '
269    2,4-2,9:            NAME           'print'
270    2,9-2,10:           LPAR           '('
271    2,10-2,25:          STRING         '"Hello, World!"'
272    2,25-2,26:          RPAR           ')'
273    2,26-2,27:          NEWLINE        '\n'
274    3,0-3,1:            NL             '\n'
275    4,0-4,0:            DEDENT         ''
276    4,0-4,9:            NAME           'say_hello'
277    4,9-4,10:           LPAR           '('
278    4,10-4,11:          RPAR           ')'
279    4,11-4,12:          NEWLINE        '\n'
280    5,0-5,0:            ENDMARKER      ''
281
282Example of tokenizing a file programmatically, reading unicode
283strings instead of bytes with :func:`generate_tokens`::
284
285    import tokenize
286
287    with tokenize.open('hello.py') as f:
288        tokens = tokenize.generate_tokens(f.readline)
289        for token in tokens:
290            print(token)
291
292Or reading bytes directly with :func:`.tokenize`::
293
294    import tokenize
295
296    with open('hello.py', 'rb') as f:
297        tokens = tokenize.tokenize(f.readline)
298        for token in tokens:
299            print(token)
300