1:mod:`re` --- Regular expression operations
2===========================================
3
4.. module:: re
5   :synopsis: Regular expression operations.
6
7.. moduleauthor:: Fredrik Lundh <fredrik@pythonware.com>
8.. sectionauthor:: Andrew M. Kuchling <amk@amk.ca>
9
10**Source code:** :source:`Lib/re.py`
11
12--------------
13
14This module provides regular expression matching operations similar to
15those found in Perl.
16
17Both patterns and strings to be searched can be Unicode strings (:class:`str`)
18as well as 8-bit strings (:class:`bytes`).
19However, Unicode strings and 8-bit strings cannot be mixed:
20that is, you cannot match a Unicode string with a byte pattern or
21vice-versa; similarly, when asking for a substitution, the replacement
22string must be of the same type as both the pattern and the search string.
23
24Regular expressions use the backslash character (``'\'``) to indicate
25special forms or to allow special characters to be used without invoking
26their special meaning.  This collides with Python's usage of the same
27character for the same purpose in string literals; for example, to match
28a literal backslash, one might have to write ``'\\\\'`` as the pattern
29string, because the regular expression must be ``\\``, and each
30backslash must be expressed as ``\\`` inside a regular Python string
31literal. Also, please note that any invalid escape sequences in Python's
32usage of the backslash in string literals now generate a :exc:`DeprecationWarning`
33and in the future this will become a :exc:`SyntaxError`. This behaviour
34will happen even if it is a valid escape sequence for a regular expression.
35
36The solution is to use Python's raw string notation for regular expression
37patterns; backslashes are not handled in any special way in a string literal
38prefixed with ``'r'``.  So ``r"\n"`` is a two-character string containing
39``'\'`` and ``'n'``, while ``"\n"`` is a one-character string containing a
40newline.  Usually patterns will be expressed in Python code using this raw
41string notation.
42
43It is important to note that most regular expression operations are available as
44module-level functions and methods on
45:ref:`compiled regular expressions <re-objects>`.  The functions are shortcuts
46that don't require you to compile a regex object first, but miss some
47fine-tuning parameters.
48
49.. seealso::
50
51   The third-party `regex <https://pypi.org/project/regex/>`_ module,
52   which has an API compatible with the standard library :mod:`re` module,
53   but offers additional functionality and a more thorough Unicode support.
54
55
56.. _re-syntax:
57
58Regular Expression Syntax
59-------------------------
60
61A regular expression (or RE) specifies a set of strings that matches it; the
62functions in this module let you check if a particular string matches a given
63regular expression (or if a given regular expression matches a particular
64string, which comes down to the same thing).
65
66Regular expressions can be concatenated to form new regular expressions; if *A*
67and *B* are both regular expressions, then *AB* is also a regular expression.
68In general, if a string *p* matches *A* and another string *q* matches *B*, the
69string *pq* will match AB.  This holds unless *A* or *B* contain low precedence
70operations; boundary conditions between *A* and *B*; or have numbered group
71references.  Thus, complex expressions can easily be constructed from simpler
72primitive expressions like the ones described here.  For details of the theory
73and implementation of regular expressions, consult the Friedl book [Frie09]_,
74or almost any textbook about compiler construction.
75
76A brief explanation of the format of regular expressions follows.  For further
77information and a gentler presentation, consult the :ref:`regex-howto`.
78
79Regular expressions can contain both special and ordinary characters. Most
80ordinary characters, like ``'A'``, ``'a'``, or ``'0'``, are the simplest regular
81expressions; they simply match themselves.  You can concatenate ordinary
82characters, so ``last`` matches the string ``'last'``.  (In the rest of this
83section, we'll write RE's in ``this special style``, usually without quotes, and
84strings to be matched ``'in single quotes'``.)
85
86Some characters, like ``'|'`` or ``'('``, are special. Special
87characters either stand for classes of ordinary characters, or affect
88how the regular expressions around them are interpreted.
89
90Repetition qualifiers (``*``, ``+``, ``?``, ``{m,n}``, etc) cannot be
91directly nested. This avoids ambiguity with the non-greedy modifier suffix
92``?``, and with other modifiers in other implementations. To apply a second
93repetition to an inner repetition, parentheses may be used. For example,
94the expression ``(?:a{6})*`` matches any multiple of six ``'a'`` characters.
95
96
97The special characters are:
98
99.. index:: single: . (dot); in regular expressions
100
101``.``
102   (Dot.)  In the default mode, this matches any character except a newline.  If
103   the :const:`DOTALL` flag has been specified, this matches any character
104   including a newline.
105
106.. index:: single: ^ (caret); in regular expressions
107
108``^``
109   (Caret.)  Matches the start of the string, and in :const:`MULTILINE` mode also
110   matches immediately after each newline.
111
112.. index:: single: $ (dollar); in regular expressions
113
114``$``
115   Matches the end of the string or just before the newline at the end of the
116   string, and in :const:`MULTILINE` mode also matches before a newline.  ``foo``
117   matches both 'foo' and 'foobar', while the regular expression ``foo$`` matches
118   only 'foo'.  More interestingly, searching for ``foo.$`` in ``'foo1\nfoo2\n'``
119   matches 'foo2' normally, but 'foo1' in :const:`MULTILINE` mode; searching for
120   a single ``$`` in ``'foo\n'`` will find two (empty) matches: one just before
121   the newline, and one at the end of the string.
122
123.. index:: single: * (asterisk); in regular expressions
124
125``*``
126   Causes the resulting RE to match 0 or more repetitions of the preceding RE, as
127   many repetitions as are possible.  ``ab*`` will match 'a', 'ab', or 'a' followed
128   by any number of 'b's.
129
130.. index:: single: + (plus); in regular expressions
131
132``+``
133   Causes the resulting RE to match 1 or more repetitions of the preceding RE.
134   ``ab+`` will match 'a' followed by any non-zero number of 'b's; it will not
135   match just 'a'.
136
137.. index:: single: ? (question mark); in regular expressions
138
139``?``
140   Causes the resulting RE to match 0 or 1 repetitions of the preceding RE.
141   ``ab?`` will match either 'a' or 'ab'.
142
143.. index::
144   single: *?; in regular expressions
145   single: +?; in regular expressions
146   single: ??; in regular expressions
147
148``*?``, ``+?``, ``??``
149   The ``'*'``, ``'+'``, and ``'?'`` qualifiers are all :dfn:`greedy`; they match
150   as much text as possible.  Sometimes this behaviour isn't desired; if the RE
151   ``<.*>`` is matched against ``'<a> b <c>'``, it will match the entire
152   string, and not just ``'<a>'``.  Adding ``?`` after the qualifier makes it
153   perform the match in :dfn:`non-greedy` or :dfn:`minimal` fashion; as *few*
154   characters as possible will be matched.  Using the RE ``<.*?>`` will match
155   only ``'<a>'``.
156
157.. index::
158   single: {} (curly brackets); in regular expressions
159
160``{m}``
161   Specifies that exactly *m* copies of the previous RE should be matched; fewer
162   matches cause the entire RE not to match.  For example, ``a{6}`` will match
163   exactly six ``'a'`` characters, but not five.
164
165``{m,n}``
166   Causes the resulting RE to match from *m* to *n* repetitions of the preceding
167   RE, attempting to match as many repetitions as possible.  For example,
168   ``a{3,5}`` will match from 3 to 5 ``'a'`` characters.  Omitting *m* specifies a
169   lower bound of zero,  and omitting *n* specifies an infinite upper bound.  As an
170   example, ``a{4,}b`` will match ``'aaaab'`` or a thousand ``'a'`` characters
171   followed by a ``'b'``, but not ``'aaab'``. The comma may not be omitted or the
172   modifier would be confused with the previously described form.
173
174``{m,n}?``
175   Causes the resulting RE to match from *m* to *n* repetitions of the preceding
176   RE, attempting to match as *few* repetitions as possible.  This is the
177   non-greedy version of the previous qualifier.  For example, on the
178   6-character string ``'aaaaaa'``, ``a{3,5}`` will match 5 ``'a'`` characters,
179   while ``a{3,5}?`` will only match 3 characters.
180
181.. index:: single: \ (backslash); in regular expressions
182
183``\``
184   Either escapes special characters (permitting you to match characters like
185   ``'*'``, ``'?'``, and so forth), or signals a special sequence; special
186   sequences are discussed below.
187
188   If you're not using a raw string to express the pattern, remember that Python
189   also uses the backslash as an escape sequence in string literals; if the escape
190   sequence isn't recognized by Python's parser, the backslash and subsequent
191   character are included in the resulting string.  However, if Python would
192   recognize the resulting sequence, the backslash should be repeated twice.  This
193   is complicated and hard to understand, so it's highly recommended that you use
194   raw strings for all but the simplest expressions.
195
196.. index::
197   single: [] (square brackets); in regular expressions
198
199``[]``
200   Used to indicate a set of characters.  In a set:
201
202   * Characters can be listed individually, e.g. ``[amk]`` will match ``'a'``,
203     ``'m'``, or ``'k'``.
204
205   .. index:: single: - (minus); in regular expressions
206
207   * Ranges of characters can be indicated by giving two characters and separating
208     them by a ``'-'``, for example ``[a-z]`` will match any lowercase ASCII letter,
209     ``[0-5][0-9]`` will match all the two-digits numbers from ``00`` to ``59``, and
210     ``[0-9A-Fa-f]`` will match any hexadecimal digit.  If ``-`` is escaped (e.g.
211     ``[a\-z]``) or if it's placed as the first or last character
212     (e.g. ``[-a]`` or ``[a-]``), it will match a literal ``'-'``.
213
214   * Special characters lose their special meaning inside sets.  For example,
215     ``[(+*)]`` will match any of the literal characters ``'('``, ``'+'``,
216     ``'*'``, or ``')'``.
217
218   .. index:: single: \ (backslash); in regular expressions
219
220   * Character classes such as ``\w`` or ``\S`` (defined below) are also accepted
221     inside a set, although the characters they match depends on whether
222     :const:`ASCII` or :const:`LOCALE` mode is in force.
223
224   .. index:: single: ^ (caret); in regular expressions
225
226   * Characters that are not within a range can be matched by :dfn:`complementing`
227     the set.  If the first character of the set is ``'^'``, all the characters
228     that are *not* in the set will be matched.  For example, ``[^5]`` will match
229     any character except ``'5'``, and ``[^^]`` will match any character except
230     ``'^'``.  ``^`` has no special meaning if it's not the first character in
231     the set.
232
233   * To match a literal ``']'`` inside a set, precede it with a backslash, or
234     place it at the beginning of the set.  For example, both ``[()[\]{}]`` and
235     ``[]()[{}]`` will both match a parenthesis.
236
237   .. .. index:: single: --; in regular expressions
238   .. .. index:: single: &&; in regular expressions
239   .. .. index:: single: ~~; in regular expressions
240   .. .. index:: single: ||; in regular expressions
241
242   * Support of nested sets and set operations as in `Unicode Technical
243     Standard #18`_ might be added in the future.  This would change the
244     syntax, so to facilitate this change a :exc:`FutureWarning` will be raised
245     in ambiguous cases for the time being.
246     That includes sets starting with a literal ``'['`` or containing literal
247     character sequences ``'--'``, ``'&&'``, ``'~~'``, and ``'||'``.  To
248     avoid a warning escape them with a backslash.
249
250   .. _Unicode Technical Standard #18: https://unicode.org/reports/tr18/
251
252   .. versionchanged:: 3.7
253      :exc:`FutureWarning` is raised if a character set contains constructs
254      that will change semantically in the future.
255
256.. index:: single: | (vertical bar); in regular expressions
257
258``|``
259   ``A|B``, where *A* and *B* can be arbitrary REs, creates a regular expression that
260   will match either *A* or *B*.  An arbitrary number of REs can be separated by the
261   ``'|'`` in this way.  This can be used inside groups (see below) as well.  As
262   the target string is scanned, REs separated by ``'|'`` are tried from left to
263   right. When one pattern completely matches, that branch is accepted. This means
264   that once *A* matches, *B* will not be tested further, even if it would
265   produce a longer overall match.  In other words, the ``'|'`` operator is never
266   greedy.  To match a literal ``'|'``, use ``\|``, or enclose it inside a
267   character class, as in ``[|]``.
268
269.. index::
270   single: () (parentheses); in regular expressions
271
272``(...)``
273   Matches whatever regular expression is inside the parentheses, and indicates the
274   start and end of a group; the contents of a group can be retrieved after a match
275   has been performed, and can be matched later in the string with the ``\number``
276   special sequence, described below.  To match the literals ``'('`` or ``')'``,
277   use ``\(`` or ``\)``, or enclose them inside a character class: ``[(]``, ``[)]``.
278
279.. index:: single: (?; in regular expressions
280
281``(?...)``
282   This is an extension notation (a ``'?'`` following a ``'('`` is not meaningful
283   otherwise).  The first character after the ``'?'`` determines what the meaning
284   and further syntax of the construct is. Extensions usually do not create a new
285   group; ``(?P<name>...)`` is the only exception to this rule. Following are the
286   currently supported extensions.
287
288``(?aiLmsux)``
289   (One or more letters from the set ``'a'``, ``'i'``, ``'L'``, ``'m'``,
290   ``'s'``, ``'u'``, ``'x'``.)  The group matches the empty string; the
291   letters set the corresponding flags: :const:`re.A` (ASCII-only matching),
292   :const:`re.I` (ignore case), :const:`re.L` (locale dependent),
293   :const:`re.M` (multi-line), :const:`re.S` (dot matches all),
294   :const:`re.U` (Unicode matching), and :const:`re.X` (verbose),
295   for the entire regular expression.
296   (The flags are described in :ref:`contents-of-module-re`.)
297   This is useful if you wish to include the flags as part of the
298   regular expression, instead of passing a *flag* argument to the
299   :func:`re.compile` function.  Flags should be used first in the
300   expression string.
301
302.. index:: single: (?:; in regular expressions
303
304``(?:...)``
305   A non-capturing version of regular parentheses.  Matches whatever regular
306   expression is inside the parentheses, but the substring matched by the group
307   *cannot* be retrieved after performing a match or referenced later in the
308   pattern.
309
310``(?aiLmsux-imsx:...)``
311   (Zero or more letters from the set ``'a'``, ``'i'``, ``'L'``, ``'m'``,
312   ``'s'``, ``'u'``, ``'x'``, optionally followed by ``'-'`` followed by
313   one or more letters from the ``'i'``, ``'m'``, ``'s'``, ``'x'``.)
314   The letters set or remove the corresponding flags:
315   :const:`re.A` (ASCII-only matching), :const:`re.I` (ignore case),
316   :const:`re.L` (locale dependent), :const:`re.M` (multi-line),
317   :const:`re.S` (dot matches all), :const:`re.U` (Unicode matching),
318   and :const:`re.X` (verbose), for the part of the expression.
319   (The flags are described in :ref:`contents-of-module-re`.)
320
321   The letters ``'a'``, ``'L'`` and ``'u'`` are mutually exclusive when used
322   as inline flags, so they can't be combined or follow ``'-'``.  Instead,
323   when one of them appears in an inline group, it overrides the matching mode
324   in the enclosing group.  In Unicode patterns ``(?a:...)`` switches to
325   ASCII-only matching, and ``(?u:...)`` switches to Unicode matching
326   (default).  In byte pattern ``(?L:...)`` switches to locale depending
327   matching, and ``(?a:...)`` switches to ASCII-only matching (default).
328   This override is only in effect for the narrow inline group, and the
329   original matching mode is restored outside of the group.
330
331   .. versionadded:: 3.6
332
333   .. versionchanged:: 3.7
334      The letters ``'a'``, ``'L'`` and ``'u'`` also can be used in a group.
335
336.. index:: single: (?P<; in regular expressions
337
338``(?P<name>...)``
339   Similar to regular parentheses, but the substring matched by the group is
340   accessible via the symbolic group name *name*.  Group names must be valid
341   Python identifiers, and each group name must be defined only once within a
342   regular expression.  A symbolic group is also a numbered group, just as if
343   the group were not named.
344
345   Named groups can be referenced in three contexts.  If the pattern is
346   ``(?P<quote>['"]).*?(?P=quote)`` (i.e. matching a string quoted with either
347   single or double quotes):
348
349   +---------------------------------------+----------------------------------+
350   | Context of reference to group "quote" | Ways to reference it             |
351   +=======================================+==================================+
352   | in the same pattern itself            | * ``(?P=quote)`` (as shown)      |
353   |                                       | * ``\1``                         |
354   +---------------------------------------+----------------------------------+
355   | when processing match object *m*      | * ``m.group('quote')``           |
356   |                                       | * ``m.end('quote')`` (etc.)      |
357   +---------------------------------------+----------------------------------+
358   | in a string passed to the *repl*      | * ``\g<quote>``                  |
359   | argument of ``re.sub()``              | * ``\g<1>``                      |
360   |                                       | * ``\1``                         |
361   +---------------------------------------+----------------------------------+
362
363.. index:: single: (?P=; in regular expressions
364
365``(?P=name)``
366   A backreference to a named group; it matches whatever text was matched by the
367   earlier group named *name*.
368
369.. index:: single: (?#; in regular expressions
370
371``(?#...)``
372   A comment; the contents of the parentheses are simply ignored.
373
374.. index:: single: (?=; in regular expressions
375
376``(?=...)``
377   Matches if ``...`` matches next, but doesn't consume any of the string.  This is
378   called a :dfn:`lookahead assertion`.  For example, ``Isaac (?=Asimov)`` will match
379   ``'Isaac '`` only if it's followed by ``'Asimov'``.
380
381.. index:: single: (?!; in regular expressions
382
383``(?!...)``
384   Matches if ``...`` doesn't match next.  This is a :dfn:`negative lookahead assertion`.
385   For example, ``Isaac (?!Asimov)`` will match ``'Isaac '`` only if it's *not*
386   followed by ``'Asimov'``.
387
388.. index:: single: (?<=; in regular expressions
389
390``(?<=...)``
391   Matches if the current position in the string is preceded by a match for ``...``
392   that ends at the current position.  This is called a :dfn:`positive lookbehind
393   assertion`. ``(?<=abc)def`` will find a match in ``'abcdef'``, since the
394   lookbehind will back up 3 characters and check if the contained pattern matches.
395   The contained pattern must only match strings of some fixed length, meaning that
396   ``abc`` or ``a|b`` are allowed, but ``a*`` and ``a{3,4}`` are not.  Note that
397   patterns which start with positive lookbehind assertions will not match at the
398   beginning of the string being searched; you will most likely want to use the
399   :func:`search` function rather than the :func:`match` function:
400
401      >>> import re
402      >>> m = re.search('(?<=abc)def', 'abcdef')
403      >>> m.group(0)
404      'def'
405
406   This example looks for a word following a hyphen:
407
408      >>> m = re.search(r'(?<=-)\w+', 'spam-egg')
409      >>> m.group(0)
410      'egg'
411
412   .. versionchanged:: 3.5
413      Added support for group references of fixed length.
414
415.. index:: single: (?<!; in regular expressions
416
417``(?<!...)``
418   Matches if the current position in the string is not preceded by a match for
419   ``...``.  This is called a :dfn:`negative lookbehind assertion`.  Similar to
420   positive lookbehind assertions, the contained pattern must only match strings of
421   some fixed length.  Patterns which start with negative lookbehind assertions may
422   match at the beginning of the string being searched.
423
424``(?(id/name)yes-pattern|no-pattern)``
425   Will try to match with ``yes-pattern`` if the group with given *id* or
426   *name* exists, and with ``no-pattern`` if it doesn't. ``no-pattern`` is
427   optional and can be omitted. For example,
428   ``(<)?(\w+@\w+(?:\.\w+)+)(?(1)>|$)`` is a poor email matching pattern, which
429   will match with ``'<user@host.com>'`` as well as ``'user@host.com'``, but
430   not with ``'<user@host.com'`` nor ``'user@host.com>'``.
431
432
433The special sequences consist of ``'\'`` and a character from the list below.
434If the ordinary character is not an ASCII digit or an ASCII letter, then the
435resulting RE will match the second character.  For example, ``\$`` matches the
436character ``'$'``.
437
438.. index:: single: \ (backslash); in regular expressions
439
440``\number``
441   Matches the contents of the group of the same number.  Groups are numbered
442   starting from 1.  For example, ``(.+) \1`` matches ``'the the'`` or ``'55 55'``,
443   but not ``'thethe'`` (note the space after the group).  This special sequence
444   can only be used to match one of the first 99 groups.  If the first digit of
445   *number* is 0, or *number* is 3 octal digits long, it will not be interpreted as
446   a group match, but as the character with octal value *number*. Inside the
447   ``'['`` and ``']'`` of a character class, all numeric escapes are treated as
448   characters.
449
450.. index:: single: \A; in regular expressions
451
452``\A``
453   Matches only at the start of the string.
454
455.. index:: single: \b; in regular expressions
456
457``\b``
458   Matches the empty string, but only at the beginning or end of a word.
459   A word is defined as a sequence of word characters.  Note that formally,
460   ``\b`` is defined as the boundary between a ``\w`` and a ``\W`` character
461   (or vice versa), or between ``\w`` and the beginning/end of the string.
462   This means that ``r'\bfoo\b'`` matches ``'foo'``, ``'foo.'``, ``'(foo)'``,
463   ``'bar foo baz'`` but not ``'foobar'`` or ``'foo3'``.
464
465   By default Unicode alphanumerics are the ones used in Unicode patterns, but
466   this can be changed by using the :const:`ASCII` flag.  Word boundaries are
467   determined by the current locale if the :const:`LOCALE` flag is used.
468   Inside a character range, ``\b`` represents the backspace character, for
469   compatibility with Python's string literals.
470
471.. index:: single: \B; in regular expressions
472
473``\B``
474   Matches the empty string, but only when it is *not* at the beginning or end
475   of a word.  This means that ``r'py\B'`` matches ``'python'``, ``'py3'``,
476   ``'py2'``, but not ``'py'``, ``'py.'``, or ``'py!'``.
477   ``\B`` is just the opposite of ``\b``, so word characters in Unicode
478   patterns are Unicode alphanumerics or the underscore, although this can
479   be changed by using the :const:`ASCII` flag.  Word boundaries are
480   determined by the current locale if the :const:`LOCALE` flag is used.
481
482.. index:: single: \d; in regular expressions
483
484``\d``
485   For Unicode (str) patterns:
486      Matches any Unicode decimal digit (that is, any character in
487      Unicode character category [Nd]).  This includes ``[0-9]``, and
488      also many other digit characters.  If the :const:`ASCII` flag is
489      used only ``[0-9]`` is matched.
490
491   For 8-bit (bytes) patterns:
492      Matches any decimal digit; this is equivalent to ``[0-9]``.
493
494.. index:: single: \D; in regular expressions
495
496``\D``
497   Matches any character which is not a decimal digit. This is
498   the opposite of ``\d``. If the :const:`ASCII` flag is used this
499   becomes the equivalent of ``[^0-9]``.
500
501.. index:: single: \s; in regular expressions
502
503``\s``
504   For Unicode (str) patterns:
505      Matches Unicode whitespace characters (which includes
506      ``[ \t\n\r\f\v]``, and also many other characters, for example the
507      non-breaking spaces mandated by typography rules in many
508      languages). If the :const:`ASCII` flag is used, only
509      ``[ \t\n\r\f\v]`` is matched.
510
511   For 8-bit (bytes) patterns:
512      Matches characters considered whitespace in the ASCII character set;
513      this is equivalent to ``[ \t\n\r\f\v]``.
514
515.. index:: single: \S; in regular expressions
516
517``\S``
518   Matches any character which is not a whitespace character. This is
519   the opposite of ``\s``. If the :const:`ASCII` flag is used this
520   becomes the equivalent of ``[^ \t\n\r\f\v]``.
521
522.. index:: single: \w; in regular expressions
523
524``\w``
525   For Unicode (str) patterns:
526      Matches Unicode word characters; this includes most characters
527      that can be part of a word in any language, as well as numbers and
528      the underscore. If the :const:`ASCII` flag is used, only
529      ``[a-zA-Z0-9_]`` is matched.
530
531   For 8-bit (bytes) patterns:
532      Matches characters considered alphanumeric in the ASCII character set;
533      this is equivalent to ``[a-zA-Z0-9_]``.  If the :const:`LOCALE` flag is
534      used, matches characters considered alphanumeric in the current locale
535      and the underscore.
536
537.. index:: single: \W; in regular expressions
538
539``\W``
540   Matches any character which is not a word character. This is
541   the opposite of ``\w``. If the :const:`ASCII` flag is used this
542   becomes the equivalent of ``[^a-zA-Z0-9_]``.  If the :const:`LOCALE` flag is
543   used, matches characters which are neither alphanumeric in the current locale
544   nor the underscore.
545
546.. index:: single: \Z; in regular expressions
547
548``\Z``
549   Matches only at the end of the string.
550
551.. index::
552   single: \a; in regular expressions
553   single: \b; in regular expressions
554   single: \f; in regular expressions
555   single: \n; in regular expressions
556   single: \N; in regular expressions
557   single: \r; in regular expressions
558   single: \t; in regular expressions
559   single: \u; in regular expressions
560   single: \U; in regular expressions
561   single: \v; in regular expressions
562   single: \x; in regular expressions
563   single: \\; in regular expressions
564
565Most of the standard escapes supported by Python string literals are also
566accepted by the regular expression parser::
567
568   \a      \b      \f      \n
569   \N      \r      \t      \u
570   \U      \v      \x      \\
571
572(Note that ``\b`` is used to represent word boundaries, and means "backspace"
573only inside character classes.)
574
575``'\u'``, ``'\U'``, and ``'\N'`` escape sequences are only recognized in Unicode
576patterns.  In bytes patterns they are errors.  Unknown escapes of ASCII
577letters are reserved for future use and treated as errors.
578
579Octal escapes are included in a limited form.  If the first digit is a 0, or if
580there are three octal digits, it is considered an octal escape. Otherwise, it is
581a group reference.  As for string literals, octal escapes are always at most
582three digits in length.
583
584.. versionchanged:: 3.3
585   The ``'\u'`` and ``'\U'`` escape sequences have been added.
586
587.. versionchanged:: 3.6
588   Unknown escapes consisting of ``'\'`` and an ASCII letter now are errors.
589
590.. versionchanged:: 3.8
591   The ``'\N{name}'`` escape sequence has been added. As in string literals,
592   it expands to the named Unicode character (e.g. ``'\N{EM DASH}'``).
593
594
595.. _contents-of-module-re:
596
597Module Contents
598---------------
599
600The module defines several functions, constants, and an exception. Some of the
601functions are simplified versions of the full featured methods for compiled
602regular expressions.  Most non-trivial applications always use the compiled
603form.
604
605.. versionchanged:: 3.6
606   Flag constants are now instances of :class:`RegexFlag`, which is a subclass of
607   :class:`enum.IntFlag`.
608
609.. function:: compile(pattern, flags=0)
610
611   Compile a regular expression pattern into a :ref:`regular expression object
612   <re-objects>`, which can be used for matching using its
613   :func:`~Pattern.match`, :func:`~Pattern.search` and other methods, described
614   below.
615
616   The expression's behaviour can be modified by specifying a *flags* value.
617   Values can be any of the following variables, combined using bitwise OR (the
618   ``|`` operator).
619
620   The sequence ::
621
622      prog = re.compile(pattern)
623      result = prog.match(string)
624
625   is equivalent to ::
626
627      result = re.match(pattern, string)
628
629   but using :func:`re.compile` and saving the resulting regular expression
630   object for reuse is more efficient when the expression will be used several
631   times in a single program.
632
633   .. note::
634
635      The compiled versions of the most recent patterns passed to
636      :func:`re.compile` and the module-level matching functions are cached, so
637      programs that use only a few regular expressions at a time needn't worry
638      about compiling regular expressions.
639
640
641.. data:: A
642          ASCII
643
644   Make ``\w``, ``\W``, ``\b``, ``\B``, ``\d``, ``\D``, ``\s`` and ``\S``
645   perform ASCII-only matching instead of full Unicode matching.  This is only
646   meaningful for Unicode patterns, and is ignored for byte patterns.
647   Corresponds to the inline flag ``(?a)``.
648
649   Note that for backward compatibility, the :const:`re.U` flag still
650   exists (as well as its synonym :const:`re.UNICODE` and its embedded
651   counterpart ``(?u)``), but these are redundant in Python 3 since
652   matches are Unicode by default for strings (and Unicode matching
653   isn't allowed for bytes).
654
655
656.. data:: DEBUG
657
658   Display debug information about compiled expression.
659   No corresponding inline flag.
660
661
662.. data:: I
663          IGNORECASE
664
665   Perform case-insensitive matching; expressions like ``[A-Z]`` will also
666   match lowercase letters.  Full Unicode matching (such as ``Ü`` matching
667   ``ü``) also works unless the :const:`re.ASCII` flag is used to disable
668   non-ASCII matches.  The current locale does not change the effect of this
669   flag unless the :const:`re.LOCALE` flag is also used.
670   Corresponds to the inline flag ``(?i)``.
671
672   Note that when the Unicode patterns ``[a-z]`` or ``[A-Z]`` are used in
673   combination with the :const:`IGNORECASE` flag, they will match the 52 ASCII
674   letters and 4 additional non-ASCII letters: 'İ' (U+0130, Latin capital
675   letter I with dot above), 'ı' (U+0131, Latin small letter dotless i),
676   'ſ' (U+017F, Latin small letter long s) and 'K' (U+212A, Kelvin sign).
677   If the :const:`ASCII` flag is used, only letters 'a' to 'z'
678   and 'A' to 'Z' are matched.
679
680.. data:: L
681          LOCALE
682
683   Make ``\w``, ``\W``, ``\b``, ``\B`` and case-insensitive matching
684   dependent on the current locale.  This flag can be used only with bytes
685   patterns.  The use of this flag is discouraged as the locale mechanism
686   is very unreliable, it only handles one "culture" at a time, and it only
687   works with 8-bit locales.  Unicode matching is already enabled by default
688   in Python 3 for Unicode (str) patterns, and it is able to handle different
689   locales/languages.
690   Corresponds to the inline flag ``(?L)``.
691
692   .. versionchanged:: 3.6
693      :const:`re.LOCALE` can be used only with bytes patterns and is
694      not compatible with :const:`re.ASCII`.
695
696   .. versionchanged:: 3.7
697      Compiled regular expression objects with the :const:`re.LOCALE` flag no
698      longer depend on the locale at compile time.  Only the locale at
699      matching time affects the result of matching.
700
701
702.. data:: M
703          MULTILINE
704
705   When specified, the pattern character ``'^'`` matches at the beginning of the
706   string and at the beginning of each line (immediately following each newline);
707   and the pattern character ``'$'`` matches at the end of the string and at the
708   end of each line (immediately preceding each newline).  By default, ``'^'``
709   matches only at the beginning of the string, and ``'$'`` only at the end of the
710   string and immediately before the newline (if any) at the end of the string.
711   Corresponds to the inline flag ``(?m)``.
712
713
714.. data:: S
715          DOTALL
716
717   Make the ``'.'`` special character match any character at all, including a
718   newline; without this flag, ``'.'`` will match anything *except* a newline.
719   Corresponds to the inline flag ``(?s)``.
720
721
722.. data:: X
723          VERBOSE
724
725   .. index:: single: # (hash); in regular expressions
726
727   This flag allows you to write regular expressions that look nicer and are
728   more readable by allowing you to visually separate logical sections of the
729   pattern and add comments. Whitespace within the pattern is ignored, except
730   when in a character class, or when preceded by an unescaped backslash,
731   or within tokens like ``*?``, ``(?:`` or ``(?P<...>``.
732   When a line contains a ``#`` that is not in a character class and is not
733   preceded by an unescaped backslash, all characters from the leftmost such
734   ``#`` through the end of the line are ignored.
735
736   This means that the two following regular expression objects that match a
737   decimal number are functionally equal::
738
739      a = re.compile(r"""\d +  # the integral part
740                         \.    # the decimal point
741                         \d *  # some fractional digits""", re.X)
742      b = re.compile(r"\d+\.\d*")
743
744   Corresponds to the inline flag ``(?x)``.
745
746
747.. function:: search(pattern, string, flags=0)
748
749   Scan through *string* looking for the first location where the regular expression
750   *pattern* produces a match, and return a corresponding :ref:`match object
751   <match-objects>`.  Return ``None`` if no position in the string matches the
752   pattern; note that this is different from finding a zero-length match at some
753   point in the string.
754
755
756.. function:: match(pattern, string, flags=0)
757
758   If zero or more characters at the beginning of *string* match the regular
759   expression *pattern*, return a corresponding :ref:`match object
760   <match-objects>`.  Return ``None`` if the string does not match the pattern;
761   note that this is different from a zero-length match.
762
763   Note that even in :const:`MULTILINE` mode, :func:`re.match` will only match
764   at the beginning of the string and not at the beginning of each line.
765
766   If you want to locate a match anywhere in *string*, use :func:`search`
767   instead (see also :ref:`search-vs-match`).
768
769
770.. function:: fullmatch(pattern, string, flags=0)
771
772   If the whole *string* matches the regular expression *pattern*, return a
773   corresponding :ref:`match object <match-objects>`.  Return ``None`` if the
774   string does not match the pattern; note that this is different from a
775   zero-length match.
776
777   .. versionadded:: 3.4
778
779
780.. function:: split(pattern, string, maxsplit=0, flags=0)
781
782   Split *string* by the occurrences of *pattern*.  If capturing parentheses are
783   used in *pattern*, then the text of all groups in the pattern are also returned
784   as part of the resulting list. If *maxsplit* is nonzero, at most *maxsplit*
785   splits occur, and the remainder of the string is returned as the final element
786   of the list. ::
787
788      >>> re.split(r'\W+', 'Words, words, words.')
789      ['Words', 'words', 'words', '']
790      >>> re.split(r'(\W+)', 'Words, words, words.')
791      ['Words', ', ', 'words', ', ', 'words', '.', '']
792      >>> re.split(r'\W+', 'Words, words, words.', 1)
793      ['Words', 'words, words.']
794      >>> re.split('[a-f]+', '0a3B9', flags=re.IGNORECASE)
795      ['0', '3', '9']
796
797   If there are capturing groups in the separator and it matches at the start of
798   the string, the result will start with an empty string.  The same holds for
799   the end of the string::
800
801      >>> re.split(r'(\W+)', '...words, words...')
802      ['', '...', 'words', ', ', 'words', '...', '']
803
804   That way, separator components are always found at the same relative
805   indices within the result list.
806
807   Empty matches for the pattern split the string only when not adjacent
808   to a previous empty match.
809
810      >>> re.split(r'\b', 'Words, words, words.')
811      ['', 'Words', ', ', 'words', ', ', 'words', '.']
812      >>> re.split(r'\W*', '...words...')
813      ['', '', 'w', 'o', 'r', 'd', 's', '', '']
814      >>> re.split(r'(\W*)', '...words...')
815      ['', '...', '', '', 'w', '', 'o', '', 'r', '', 'd', '', 's', '...', '', '', '']
816
817   .. versionchanged:: 3.1
818      Added the optional flags argument.
819
820   .. versionchanged:: 3.7
821      Added support of splitting on a pattern that could match an empty string.
822
823
824.. function:: findall(pattern, string, flags=0)
825
826   Return all non-overlapping matches of *pattern* in *string*, as a list of
827   strings.  The *string* is scanned left-to-right, and matches are returned in
828   the order found.  If one or more groups are present in the pattern, return a
829   list of groups; this will be a list of tuples if the pattern has more than
830   one group.  Empty matches are included in the result.
831
832   .. versionchanged:: 3.7
833      Non-empty matches can now start just after a previous empty match.
834
835
836.. function:: finditer(pattern, string, flags=0)
837
838   Return an :term:`iterator` yielding :ref:`match objects <match-objects>` over
839   all non-overlapping matches for the RE *pattern* in *string*.  The *string*
840   is scanned left-to-right, and matches are returned in the order found.  Empty
841   matches are included in the result.
842
843   .. versionchanged:: 3.7
844      Non-empty matches can now start just after a previous empty match.
845
846
847.. function:: sub(pattern, repl, string, count=0, flags=0)
848
849   Return the string obtained by replacing the leftmost non-overlapping occurrences
850   of *pattern* in *string* by the replacement *repl*.  If the pattern isn't found,
851   *string* is returned unchanged.  *repl* can be a string or a function; if it is
852   a string, any backslash escapes in it are processed.  That is, ``\n`` is
853   converted to a single newline character, ``\r`` is converted to a carriage return, and
854   so forth.  Unknown escapes of ASCII letters are reserved for future use and
855   treated as errors.  Other unknown escapes such as ``\&`` are left alone.
856   Backreferences, such
857   as ``\6``, are replaced with the substring matched by group 6 in the pattern.
858   For example::
859
860      >>> re.sub(r'def\s+([a-zA-Z_][a-zA-Z_0-9]*)\s*\(\s*\):',
861      ...        r'static PyObject*\npy_\1(void)\n{',
862      ...        'def myfunc():')
863      'static PyObject*\npy_myfunc(void)\n{'
864
865   If *repl* is a function, it is called for every non-overlapping occurrence of
866   *pattern*.  The function takes a single :ref:`match object <match-objects>`
867   argument, and returns the replacement string.  For example::
868
869      >>> def dashrepl(matchobj):
870      ...     if matchobj.group(0) == '-': return ' '
871      ...     else: return '-'
872      >>> re.sub('-{1,2}', dashrepl, 'pro----gram-files')
873      'pro--gram files'
874      >>> re.sub(r'\sAND\s', ' & ', 'Baked Beans And Spam', flags=re.IGNORECASE)
875      'Baked Beans & Spam'
876
877   The pattern may be a string or a :ref:`pattern object <re-objects>`.
878
879   The optional argument *count* is the maximum number of pattern occurrences to be
880   replaced; *count* must be a non-negative integer.  If omitted or zero, all
881   occurrences will be replaced. Empty matches for the pattern are replaced only
882   when not adjacent to a previous empty match, so ``sub('x*', '-', 'abxd')`` returns
883   ``'-a-b--d-'``.
884
885   .. index:: single: \g; in regular expressions
886
887   In string-type *repl* arguments, in addition to the character escapes and
888   backreferences described above,
889   ``\g<name>`` will use the substring matched by the group named ``name``, as
890   defined by the ``(?P<name>...)`` syntax. ``\g<number>`` uses the corresponding
891   group number; ``\g<2>`` is therefore equivalent to ``\2``, but isn't ambiguous
892   in a replacement such as ``\g<2>0``.  ``\20`` would be interpreted as a
893   reference to group 20, not a reference to group 2 followed by the literal
894   character ``'0'``.  The backreference ``\g<0>`` substitutes in the entire
895   substring matched by the RE.
896
897   .. versionchanged:: 3.1
898      Added the optional flags argument.
899
900   .. versionchanged:: 3.5
901      Unmatched groups are replaced with an empty string.
902
903   .. versionchanged:: 3.6
904      Unknown escapes in *pattern* consisting of ``'\'`` and an ASCII letter
905      now are errors.
906
907   .. versionchanged:: 3.7
908      Unknown escapes in *repl* consisting of ``'\'`` and an ASCII letter
909      now are errors.
910
911   .. versionchanged:: 3.7
912      Empty matches for the pattern are replaced when adjacent to a previous
913      non-empty match.
914
915
916.. function:: subn(pattern, repl, string, count=0, flags=0)
917
918   Perform the same operation as :func:`sub`, but return a tuple ``(new_string,
919   number_of_subs_made)``.
920
921   .. versionchanged:: 3.1
922      Added the optional flags argument.
923
924   .. versionchanged:: 3.5
925      Unmatched groups are replaced with an empty string.
926
927
928.. function:: escape(pattern)
929
930   Escape special characters in *pattern*.
931   This is useful if you want to match an arbitrary literal string that may
932   have regular expression metacharacters in it.  For example::
933
934      >>> print(re.escape('http://www.python.org'))
935      http://www\.python\.org
936
937      >>> legal_chars = string.ascii_lowercase + string.digits + "!#$%&'*+-.^_`|~:"
938      >>> print('[%s]+' % re.escape(legal_chars))
939      [abcdefghijklmnopqrstuvwxyz0123456789!\#\$%\&'\*\+\-\.\^_`\|\~:]+
940
941      >>> operators = ['+', '-', '*', '/', '**']
942      >>> print('|'.join(map(re.escape, sorted(operators, reverse=True))))
943      /|\-|\+|\*\*|\*
944
945   This function must not be used for the replacement string in :func:`sub`
946   and :func:`subn`, only backslashes should be escaped.  For example::
947
948      >>> digits_re = r'\d+'
949      >>> sample = '/usr/sbin/sendmail - 0 errors, 12 warnings'
950      >>> print(re.sub(digits_re, digits_re.replace('\\', r'\\'), sample))
951      /usr/sbin/sendmail - \d+ errors, \d+ warnings
952
953   .. versionchanged:: 3.3
954      The ``'_'`` character is no longer escaped.
955
956   .. versionchanged:: 3.7
957      Only characters that can have special meaning in a regular expression
958      are escaped. As a result, ``'!'``, ``'"'``, ``'%'``, ``"'"``, ``','``,
959      ``'/'``, ``':'``, ``';'``, ``'<'``, ``'='``, ``'>'``, ``'@'``, and
960      ``"`"`` are no longer escaped.
961
962
963.. function:: purge()
964
965   Clear the regular expression cache.
966
967
968.. exception:: error(msg, pattern=None, pos=None)
969
970   Exception raised when a string passed to one of the functions here is not a
971   valid regular expression (for example, it might contain unmatched parentheses)
972   or when some other error occurs during compilation or matching.  It is never an
973   error if a string contains no match for a pattern.  The error instance has
974   the following additional attributes:
975
976   .. attribute:: msg
977
978      The unformatted error message.
979
980   .. attribute:: pattern
981
982      The regular expression pattern.
983
984   .. attribute:: pos
985
986      The index in *pattern* where compilation failed (may be ``None``).
987
988   .. attribute:: lineno
989
990      The line corresponding to *pos* (may be ``None``).
991
992   .. attribute:: colno
993
994      The column corresponding to *pos* (may be ``None``).
995
996   .. versionchanged:: 3.5
997      Added additional attributes.
998
999.. _re-objects:
1000
1001Regular Expression Objects
1002--------------------------
1003
1004Compiled regular expression objects support the following methods and
1005attributes:
1006
1007.. method:: Pattern.search(string[, pos[, endpos]])
1008
1009   Scan through *string* looking for the first location where this regular
1010   expression produces a match, and return a corresponding :ref:`match object
1011   <match-objects>`.  Return ``None`` if no position in the string matches the
1012   pattern; note that this is different from finding a zero-length match at some
1013   point in the string.
1014
1015   The optional second parameter *pos* gives an index in the string where the
1016   search is to start; it defaults to ``0``.  This is not completely equivalent to
1017   slicing the string; the ``'^'`` pattern character matches at the real beginning
1018   of the string and at positions just after a newline, but not necessarily at the
1019   index where the search is to start.
1020
1021   The optional parameter *endpos* limits how far the string will be searched; it
1022   will be as if the string is *endpos* characters long, so only the characters
1023   from *pos* to ``endpos - 1`` will be searched for a match.  If *endpos* is less
1024   than *pos*, no match will be found; otherwise, if *rx* is a compiled regular
1025   expression object, ``rx.search(string, 0, 50)`` is equivalent to
1026   ``rx.search(string[:50], 0)``. ::
1027
1028      >>> pattern = re.compile("d")
1029      >>> pattern.search("dog")     # Match at index 0
1030      <re.Match object; span=(0, 1), match='d'>
1031      >>> pattern.search("dog", 1)  # No match; search doesn't include the "d"
1032
1033
1034.. method:: Pattern.match(string[, pos[, endpos]])
1035
1036   If zero or more characters at the *beginning* of *string* match this regular
1037   expression, return a corresponding :ref:`match object <match-objects>`.
1038   Return ``None`` if the string does not match the pattern; note that this is
1039   different from a zero-length match.
1040
1041   The optional *pos* and *endpos* parameters have the same meaning as for the
1042   :meth:`~Pattern.search` method. ::
1043
1044      >>> pattern = re.compile("o")
1045      >>> pattern.match("dog")      # No match as "o" is not at the start of "dog".
1046      >>> pattern.match("dog", 1)   # Match as "o" is the 2nd character of "dog".
1047      <re.Match object; span=(1, 2), match='o'>
1048
1049   If you want to locate a match anywhere in *string*, use
1050   :meth:`~Pattern.search` instead (see also :ref:`search-vs-match`).
1051
1052
1053.. method:: Pattern.fullmatch(string[, pos[, endpos]])
1054
1055   If the whole *string* matches this regular expression, return a corresponding
1056   :ref:`match object <match-objects>`.  Return ``None`` if the string does not
1057   match the pattern; note that this is different from a zero-length match.
1058
1059   The optional *pos* and *endpos* parameters have the same meaning as for the
1060   :meth:`~Pattern.search` method. ::
1061
1062      >>> pattern = re.compile("o[gh]")
1063      >>> pattern.fullmatch("dog")      # No match as "o" is not at the start of "dog".
1064      >>> pattern.fullmatch("ogre")     # No match as not the full string matches.
1065      >>> pattern.fullmatch("doggie", 1, 3)   # Matches within given limits.
1066      <re.Match object; span=(1, 3), match='og'>
1067
1068   .. versionadded:: 3.4
1069
1070
1071.. method:: Pattern.split(string, maxsplit=0)
1072
1073   Identical to the :func:`split` function, using the compiled pattern.
1074
1075
1076.. method:: Pattern.findall(string[, pos[, endpos]])
1077
1078   Similar to the :func:`findall` function, using the compiled pattern, but
1079   also accepts optional *pos* and *endpos* parameters that limit the search
1080   region like for :meth:`search`.
1081
1082
1083.. method:: Pattern.finditer(string[, pos[, endpos]])
1084
1085   Similar to the :func:`finditer` function, using the compiled pattern, but
1086   also accepts optional *pos* and *endpos* parameters that limit the search
1087   region like for :meth:`search`.
1088
1089
1090.. method:: Pattern.sub(repl, string, count=0)
1091
1092   Identical to the :func:`sub` function, using the compiled pattern.
1093
1094
1095.. method:: Pattern.subn(repl, string, count=0)
1096
1097   Identical to the :func:`subn` function, using the compiled pattern.
1098
1099
1100.. attribute:: Pattern.flags
1101
1102   The regex matching flags.  This is a combination of the flags given to
1103   :func:`.compile`, any ``(?...)`` inline flags in the pattern, and implicit
1104   flags such as :data:`UNICODE` if the pattern is a Unicode string.
1105
1106
1107.. attribute:: Pattern.groups
1108
1109   The number of capturing groups in the pattern.
1110
1111
1112.. attribute:: Pattern.groupindex
1113
1114   A dictionary mapping any symbolic group names defined by ``(?P<id>)`` to group
1115   numbers.  The dictionary is empty if no symbolic groups were used in the
1116   pattern.
1117
1118
1119.. attribute:: Pattern.pattern
1120
1121   The pattern string from which the pattern object was compiled.
1122
1123
1124.. versionchanged:: 3.7
1125   Added support of :func:`copy.copy` and :func:`copy.deepcopy`.  Compiled
1126   regular expression objects are considered atomic.
1127
1128
1129.. _match-objects:
1130
1131Match Objects
1132-------------
1133
1134Match objects always have a boolean value of ``True``.
1135Since :meth:`~Pattern.match` and :meth:`~Pattern.search` return ``None``
1136when there is no match, you can test whether there was a match with a simple
1137``if`` statement::
1138
1139   match = re.search(pattern, string)
1140   if match:
1141       process(match)
1142
1143Match objects support the following methods and attributes:
1144
1145
1146.. method:: Match.expand(template)
1147
1148   Return the string obtained by doing backslash substitution on the template
1149   string *template*, as done by the :meth:`~Pattern.sub` method.
1150   Escapes such as ``\n`` are converted to the appropriate characters,
1151   and numeric backreferences (``\1``, ``\2``) and named backreferences
1152   (``\g<1>``, ``\g<name>``) are replaced by the contents of the
1153   corresponding group.
1154
1155   .. versionchanged:: 3.5
1156      Unmatched groups are replaced with an empty string.
1157
1158.. method:: Match.group([group1, ...])
1159
1160   Returns one or more subgroups of the match.  If there is a single argument, the
1161   result is a single string; if there are multiple arguments, the result is a
1162   tuple with one item per argument. Without arguments, *group1* defaults to zero
1163   (the whole match is returned). If a *groupN* argument is zero, the corresponding
1164   return value is the entire matching string; if it is in the inclusive range
1165   [1..99], it is the string matching the corresponding parenthesized group.  If a
1166   group number is negative or larger than the number of groups defined in the
1167   pattern, an :exc:`IndexError` exception is raised. If a group is contained in a
1168   part of the pattern that did not match, the corresponding result is ``None``.
1169   If a group is contained in a part of the pattern that matched multiple times,
1170   the last match is returned. ::
1171
1172      >>> m = re.match(r"(\w+) (\w+)", "Isaac Newton, physicist")
1173      >>> m.group(0)       # The entire match
1174      'Isaac Newton'
1175      >>> m.group(1)       # The first parenthesized subgroup.
1176      'Isaac'
1177      >>> m.group(2)       # The second parenthesized subgroup.
1178      'Newton'
1179      >>> m.group(1, 2)    # Multiple arguments give us a tuple.
1180      ('Isaac', 'Newton')
1181
1182   If the regular expression uses the ``(?P<name>...)`` syntax, the *groupN*
1183   arguments may also be strings identifying groups by their group name.  If a
1184   string argument is not used as a group name in the pattern, an :exc:`IndexError`
1185   exception is raised.
1186
1187   A moderately complicated example::
1188
1189      >>> m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcolm Reynolds")
1190      >>> m.group('first_name')
1191      'Malcolm'
1192      >>> m.group('last_name')
1193      'Reynolds'
1194
1195   Named groups can also be referred to by their index::
1196
1197      >>> m.group(1)
1198      'Malcolm'
1199      >>> m.group(2)
1200      'Reynolds'
1201
1202   If a group matches multiple times, only the last match is accessible::
1203
1204      >>> m = re.match(r"(..)+", "a1b2c3")  # Matches 3 times.
1205      >>> m.group(1)                        # Returns only the last match.
1206      'c3'
1207
1208
1209.. method:: Match.__getitem__(g)
1210
1211   This is identical to ``m.group(g)``.  This allows easier access to
1212   an individual group from a match::
1213
1214      >>> m = re.match(r"(\w+) (\w+)", "Isaac Newton, physicist")
1215      >>> m[0]       # The entire match
1216      'Isaac Newton'
1217      >>> m[1]       # The first parenthesized subgroup.
1218      'Isaac'
1219      >>> m[2]       # The second parenthesized subgroup.
1220      'Newton'
1221
1222   .. versionadded:: 3.6
1223
1224
1225.. method:: Match.groups(default=None)
1226
1227   Return a tuple containing all the subgroups of the match, from 1 up to however
1228   many groups are in the pattern.  The *default* argument is used for groups that
1229   did not participate in the match; it defaults to ``None``.
1230
1231   For example::
1232
1233      >>> m = re.match(r"(\d+)\.(\d+)", "24.1632")
1234      >>> m.groups()
1235      ('24', '1632')
1236
1237   If we make the decimal place and everything after it optional, not all groups
1238   might participate in the match.  These groups will default to ``None`` unless
1239   the *default* argument is given::
1240
1241      >>> m = re.match(r"(\d+)\.?(\d+)?", "24")
1242      >>> m.groups()      # Second group defaults to None.
1243      ('24', None)
1244      >>> m.groups('0')   # Now, the second group defaults to '0'.
1245      ('24', '0')
1246
1247
1248.. method:: Match.groupdict(default=None)
1249
1250   Return a dictionary containing all the *named* subgroups of the match, keyed by
1251   the subgroup name.  The *default* argument is used for groups that did not
1252   participate in the match; it defaults to ``None``.  For example::
1253
1254      >>> m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcolm Reynolds")
1255      >>> m.groupdict()
1256      {'first_name': 'Malcolm', 'last_name': 'Reynolds'}
1257
1258
1259.. method:: Match.start([group])
1260            Match.end([group])
1261
1262   Return the indices of the start and end of the substring matched by *group*;
1263   *group* defaults to zero (meaning the whole matched substring). Return ``-1`` if
1264   *group* exists but did not contribute to the match.  For a match object *m*, and
1265   a group *g* that did contribute to the match, the substring matched by group *g*
1266   (equivalent to ``m.group(g)``) is ::
1267
1268      m.string[m.start(g):m.end(g)]
1269
1270   Note that ``m.start(group)`` will equal ``m.end(group)`` if *group* matched a
1271   null string.  For example, after ``m = re.search('b(c?)', 'cba')``,
1272   ``m.start(0)`` is 1, ``m.end(0)`` is 2, ``m.start(1)`` and ``m.end(1)`` are both
1273   2, and ``m.start(2)`` raises an :exc:`IndexError` exception.
1274
1275   An example that will remove *remove_this* from email addresses::
1276
1277      >>> email = "tony@tiremove_thisger.net"
1278      >>> m = re.search("remove_this", email)
1279      >>> email[:m.start()] + email[m.end():]
1280      'tony@tiger.net'
1281
1282
1283.. method:: Match.span([group])
1284
1285   For a match *m*, return the 2-tuple ``(m.start(group), m.end(group))``. Note
1286   that if *group* did not contribute to the match, this is ``(-1, -1)``.
1287   *group* defaults to zero, the entire match.
1288
1289
1290.. attribute:: Match.pos
1291
1292   The value of *pos* which was passed to the :meth:`~Pattern.search` or
1293   :meth:`~Pattern.match` method of a :ref:`regex object <re-objects>`.  This is
1294   the index into the string at which the RE engine started looking for a match.
1295
1296
1297.. attribute:: Match.endpos
1298
1299   The value of *endpos* which was passed to the :meth:`~Pattern.search` or
1300   :meth:`~Pattern.match` method of a :ref:`regex object <re-objects>`.  This is
1301   the index into the string beyond which the RE engine will not go.
1302
1303
1304.. attribute:: Match.lastindex
1305
1306   The integer index of the last matched capturing group, or ``None`` if no group
1307   was matched at all. For example, the expressions ``(a)b``, ``((a)(b))``, and
1308   ``((ab))`` will have ``lastindex == 1`` if applied to the string ``'ab'``, while
1309   the expression ``(a)(b)`` will have ``lastindex == 2``, if applied to the same
1310   string.
1311
1312
1313.. attribute:: Match.lastgroup
1314
1315   The name of the last matched capturing group, or ``None`` if the group didn't
1316   have a name, or if no group was matched at all.
1317
1318
1319.. attribute:: Match.re
1320
1321   The :ref:`regular expression object <re-objects>` whose :meth:`~Pattern.match` or
1322   :meth:`~Pattern.search` method produced this match instance.
1323
1324
1325.. attribute:: Match.string
1326
1327   The string passed to :meth:`~Pattern.match` or :meth:`~Pattern.search`.
1328
1329
1330.. versionchanged:: 3.7
1331   Added support of :func:`copy.copy` and :func:`copy.deepcopy`.  Match objects
1332   are considered atomic.
1333
1334
1335.. _re-examples:
1336
1337Regular Expression Examples
1338---------------------------
1339
1340
1341Checking for a Pair
1342^^^^^^^^^^^^^^^^^^^
1343
1344In this example, we'll use the following helper function to display match
1345objects a little more gracefully::
1346
1347   def displaymatch(match):
1348       if match is None:
1349           return None
1350       return '<Match: %r, groups=%r>' % (match.group(), match.groups())
1351
1352Suppose you are writing a poker program where a player's hand is represented as
1353a 5-character string with each character representing a card, "a" for ace, "k"
1354for king, "q" for queen, "j" for jack, "t" for 10, and "2" through "9"
1355representing the card with that value.
1356
1357To see if a given string is a valid hand, one could do the following::
1358
1359   >>> valid = re.compile(r"^[a2-9tjqk]{5}$")
1360   >>> displaymatch(valid.match("akt5q"))  # Valid.
1361   "<Match: 'akt5q', groups=()>"
1362   >>> displaymatch(valid.match("akt5e"))  # Invalid.
1363   >>> displaymatch(valid.match("akt"))    # Invalid.
1364   >>> displaymatch(valid.match("727ak"))  # Valid.
1365   "<Match: '727ak', groups=()>"
1366
1367That last hand, ``"727ak"``, contained a pair, or two of the same valued cards.
1368To match this with a regular expression, one could use backreferences as such::
1369
1370   >>> pair = re.compile(r".*(.).*\1")
1371   >>> displaymatch(pair.match("717ak"))     # Pair of 7s.
1372   "<Match: '717', groups=('7',)>"
1373   >>> displaymatch(pair.match("718ak"))     # No pairs.
1374   >>> displaymatch(pair.match("354aa"))     # Pair of aces.
1375   "<Match: '354aa', groups=('a',)>"
1376
1377To find out what card the pair consists of, one could use the
1378:meth:`~Match.group` method of the match object in the following manner::
1379
1380   >>> pair = re.compile(r".*(.).*\1")
1381   >>> pair.match("717ak").group(1)
1382   '7'
1383
1384   # Error because re.match() returns None, which doesn't have a group() method:
1385   >>> pair.match("718ak").group(1)
1386   Traceback (most recent call last):
1387     File "<pyshell#23>", line 1, in <module>
1388       re.match(r".*(.).*\1", "718ak").group(1)
1389   AttributeError: 'NoneType' object has no attribute 'group'
1390
1391   >>> pair.match("354aa").group(1)
1392   'a'
1393
1394
1395Simulating scanf()
1396^^^^^^^^^^^^^^^^^^
1397
1398.. index:: single: scanf()
1399
1400Python does not currently have an equivalent to :c:func:`scanf`.  Regular
1401expressions are generally more powerful, though also more verbose, than
1402:c:func:`scanf` format strings.  The table below offers some more-or-less
1403equivalent mappings between :c:func:`scanf` format tokens and regular
1404expressions.
1405
1406+--------------------------------+---------------------------------------------+
1407| :c:func:`scanf` Token          | Regular Expression                          |
1408+================================+=============================================+
1409| ``%c``                         | ``.``                                       |
1410+--------------------------------+---------------------------------------------+
1411| ``%5c``                        | ``.{5}``                                    |
1412+--------------------------------+---------------------------------------------+
1413| ``%d``                         | ``[-+]?\d+``                                |
1414+--------------------------------+---------------------------------------------+
1415| ``%e``, ``%E``, ``%f``, ``%g`` | ``[-+]?(\d+(\.\d*)?|\.\d+)([eE][-+]?\d+)?`` |
1416+--------------------------------+---------------------------------------------+
1417| ``%i``                         | ``[-+]?(0[xX][\dA-Fa-f]+|0[0-7]*|\d+)``     |
1418+--------------------------------+---------------------------------------------+
1419| ``%o``                         | ``[-+]?[0-7]+``                             |
1420+--------------------------------+---------------------------------------------+
1421| ``%s``                         | ``\S+``                                     |
1422+--------------------------------+---------------------------------------------+
1423| ``%u``                         | ``\d+``                                     |
1424+--------------------------------+---------------------------------------------+
1425| ``%x``, ``%X``                 | ``[-+]?(0[xX])?[\dA-Fa-f]+``                |
1426+--------------------------------+---------------------------------------------+
1427
1428To extract the filename and numbers from a string like ::
1429
1430   /usr/sbin/sendmail - 0 errors, 4 warnings
1431
1432you would use a :c:func:`scanf` format like ::
1433
1434   %s - %d errors, %d warnings
1435
1436The equivalent regular expression would be ::
1437
1438   (\S+) - (\d+) errors, (\d+) warnings
1439
1440
1441.. _search-vs-match:
1442
1443search() vs. match()
1444^^^^^^^^^^^^^^^^^^^^
1445
1446.. sectionauthor:: Fred L. Drake, Jr. <fdrake@acm.org>
1447
1448Python offers two different primitive operations based on regular expressions:
1449:func:`re.match` checks for a match only at the beginning of the string, while
1450:func:`re.search` checks for a match anywhere in the string (this is what Perl
1451does by default).
1452
1453For example::
1454
1455   >>> re.match("c", "abcdef")    # No match
1456   >>> re.search("c", "abcdef")   # Match
1457   <re.Match object; span=(2, 3), match='c'>
1458
1459Regular expressions beginning with ``'^'`` can be used with :func:`search` to
1460restrict the match at the beginning of the string::
1461
1462   >>> re.match("c", "abcdef")    # No match
1463   >>> re.search("^c", "abcdef")  # No match
1464   >>> re.search("^a", "abcdef")  # Match
1465   <re.Match object; span=(0, 1), match='a'>
1466
1467Note however that in :const:`MULTILINE` mode :func:`match` only matches at the
1468beginning of the string, whereas using :func:`search` with a regular expression
1469beginning with ``'^'`` will match at the beginning of each line. ::
1470
1471   >>> re.match('X', 'A\nB\nX', re.MULTILINE)  # No match
1472   >>> re.search('^X', 'A\nB\nX', re.MULTILINE)  # Match
1473   <re.Match object; span=(4, 5), match='X'>
1474
1475
1476Making a Phonebook
1477^^^^^^^^^^^^^^^^^^
1478
1479:func:`split` splits a string into a list delimited by the passed pattern.  The
1480method is invaluable for converting textual data into data structures that can be
1481easily read and modified by Python as demonstrated in the following example that
1482creates a phonebook.
1483
1484First, here is the input.  Normally it may come from a file, here we are using
1485triple-quoted string syntax
1486
1487.. doctest::
1488
1489   >>> text = """Ross McFluff: 834.345.1254 155 Elm Street
1490   ...
1491   ... Ronald Heathmore: 892.345.3428 436 Finley Avenue
1492   ... Frank Burger: 925.541.7625 662 South Dogwood Way
1493   ...
1494   ...
1495   ... Heather Albrecht: 548.326.4584 919 Park Place"""
1496
1497The entries are separated by one or more newlines. Now we convert the string
1498into a list with each nonempty line having its own entry:
1499
1500.. doctest::
1501   :options: +NORMALIZE_WHITESPACE
1502
1503   >>> entries = re.split("\n+", text)
1504   >>> entries
1505   ['Ross McFluff: 834.345.1254 155 Elm Street',
1506   'Ronald Heathmore: 892.345.3428 436 Finley Avenue',
1507   'Frank Burger: 925.541.7625 662 South Dogwood Way',
1508   'Heather Albrecht: 548.326.4584 919 Park Place']
1509
1510Finally, split each entry into a list with first name, last name, telephone
1511number, and address.  We use the ``maxsplit`` parameter of :func:`split`
1512because the address has spaces, our splitting pattern, in it:
1513
1514.. doctest::
1515   :options: +NORMALIZE_WHITESPACE
1516
1517   >>> [re.split(":? ", entry, 3) for entry in entries]
1518   [['Ross', 'McFluff', '834.345.1254', '155 Elm Street'],
1519   ['Ronald', 'Heathmore', '892.345.3428', '436 Finley Avenue'],
1520   ['Frank', 'Burger', '925.541.7625', '662 South Dogwood Way'],
1521   ['Heather', 'Albrecht', '548.326.4584', '919 Park Place']]
1522
1523The ``:?`` pattern matches the colon after the last name, so that it does not
1524occur in the result list.  With a ``maxsplit`` of ``4``, we could separate the
1525house number from the street name:
1526
1527.. doctest::
1528   :options: +NORMALIZE_WHITESPACE
1529
1530   >>> [re.split(":? ", entry, 4) for entry in entries]
1531   [['Ross', 'McFluff', '834.345.1254', '155', 'Elm Street'],
1532   ['Ronald', 'Heathmore', '892.345.3428', '436', 'Finley Avenue'],
1533   ['Frank', 'Burger', '925.541.7625', '662', 'South Dogwood Way'],
1534   ['Heather', 'Albrecht', '548.326.4584', '919', 'Park Place']]
1535
1536
1537Text Munging
1538^^^^^^^^^^^^
1539
1540:func:`sub` replaces every occurrence of a pattern with a string or the
1541result of a function.  This example demonstrates using :func:`sub` with
1542a function to "munge" text, or randomize the order of all the characters
1543in each word of a sentence except for the first and last characters::
1544
1545   >>> def repl(m):
1546   ...     inner_word = list(m.group(2))
1547   ...     random.shuffle(inner_word)
1548   ...     return m.group(1) + "".join(inner_word) + m.group(3)
1549   >>> text = "Professor Abdolmalek, please report your absences promptly."
1550   >>> re.sub(r"(\w)(\w+)(\w)", repl, text)
1551   'Poefsrosr Aealmlobdk, pslaee reorpt your abnseces plmrptoy.'
1552   >>> re.sub(r"(\w)(\w+)(\w)", repl, text)
1553   'Pofsroser Aodlambelk, plasee reoprt yuor asnebces potlmrpy.'
1554
1555
1556Finding all Adverbs
1557^^^^^^^^^^^^^^^^^^^
1558
1559:func:`findall` matches *all* occurrences of a pattern, not just the first
1560one as :func:`search` does.  For example, if a writer wanted to
1561find all of the adverbs in some text, they might use :func:`findall` in
1562the following manner::
1563
1564   >>> text = "He was carefully disguised but captured quickly by police."
1565   >>> re.findall(r"\w+ly", text)
1566   ['carefully', 'quickly']
1567
1568
1569Finding all Adverbs and their Positions
1570^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1571
1572If one wants more information about all matches of a pattern than the matched
1573text, :func:`finditer` is useful as it provides :ref:`match objects
1574<match-objects>` instead of strings.  Continuing with the previous example, if
1575a writer wanted to find all of the adverbs *and their positions* in
1576some text, they would use :func:`finditer` in the following manner::
1577
1578   >>> text = "He was carefully disguised but captured quickly by police."
1579   >>> for m in re.finditer(r"\w+ly", text):
1580   ...     print('%02d-%02d: %s' % (m.start(), m.end(), m.group(0)))
1581   07-16: carefully
1582   40-47: quickly
1583
1584
1585Raw String Notation
1586^^^^^^^^^^^^^^^^^^^
1587
1588Raw string notation (``r"text"``) keeps regular expressions sane.  Without it,
1589every backslash (``'\'``) in a regular expression would have to be prefixed with
1590another one to escape it.  For example, the two following lines of code are
1591functionally identical::
1592
1593   >>> re.match(r"\W(.)\1\W", " ff ")
1594   <re.Match object; span=(0, 4), match=' ff '>
1595   >>> re.match("\\W(.)\\1\\W", " ff ")
1596   <re.Match object; span=(0, 4), match=' ff '>
1597
1598When one wants to match a literal backslash, it must be escaped in the regular
1599expression.  With raw string notation, this means ``r"\\"``.  Without raw string
1600notation, one must use ``"\\\\"``, making the following lines of code
1601functionally identical::
1602
1603   >>> re.match(r"\\", r"\\")
1604   <re.Match object; span=(0, 1), match='\\'>
1605   >>> re.match("\\\\", r"\\")
1606   <re.Match object; span=(0, 1), match='\\'>
1607
1608
1609Writing a Tokenizer
1610^^^^^^^^^^^^^^^^^^^
1611
1612A `tokenizer or scanner <https://en.wikipedia.org/wiki/Lexical_analysis>`_
1613analyzes a string to categorize groups of characters.  This is a useful first
1614step in writing a compiler or interpreter.
1615
1616The text categories are specified with regular expressions.  The technique is
1617to combine those into a single master regular expression and to loop over
1618successive matches::
1619
1620    from typing import NamedTuple
1621    import re
1622
1623    class Token(NamedTuple):
1624        type: str
1625        value: str
1626        line: int
1627        column: int
1628
1629    def tokenize(code):
1630        keywords = {'IF', 'THEN', 'ENDIF', 'FOR', 'NEXT', 'GOSUB', 'RETURN'}
1631        token_specification = [
1632            ('NUMBER',   r'\d+(\.\d*)?'),  # Integer or decimal number
1633            ('ASSIGN',   r':='),           # Assignment operator
1634            ('END',      r';'),            # Statement terminator
1635            ('ID',       r'[A-Za-z]+'),    # Identifiers
1636            ('OP',       r'[+\-*/]'),      # Arithmetic operators
1637            ('NEWLINE',  r'\n'),           # Line endings
1638            ('SKIP',     r'[ \t]+'),       # Skip over spaces and tabs
1639            ('MISMATCH', r'.'),            # Any other character
1640        ]
1641        tok_regex = '|'.join('(?P<%s>%s)' % pair for pair in token_specification)
1642        line_num = 1
1643        line_start = 0
1644        for mo in re.finditer(tok_regex, code):
1645            kind = mo.lastgroup
1646            value = mo.group()
1647            column = mo.start() - line_start
1648            if kind == 'NUMBER':
1649                value = float(value) if '.' in value else int(value)
1650            elif kind == 'ID' and value in keywords:
1651                kind = value
1652            elif kind == 'NEWLINE':
1653                line_start = mo.end()
1654                line_num += 1
1655                continue
1656            elif kind == 'SKIP':
1657                continue
1658            elif kind == 'MISMATCH':
1659                raise RuntimeError(f'{value!r} unexpected on line {line_num}')
1660            yield Token(kind, value, line_num, column)
1661
1662    statements = '''
1663        IF quantity THEN
1664            total := total + price * quantity;
1665            tax := price * 0.05;
1666        ENDIF;
1667    '''
1668
1669    for token in tokenize(statements):
1670        print(token)
1671
1672The tokenizer produces the following output::
1673
1674    Token(type='IF', value='IF', line=2, column=4)
1675    Token(type='ID', value='quantity', line=2, column=7)
1676    Token(type='THEN', value='THEN', line=2, column=16)
1677    Token(type='ID', value='total', line=3, column=8)
1678    Token(type='ASSIGN', value=':=', line=3, column=14)
1679    Token(type='ID', value='total', line=3, column=17)
1680    Token(type='OP', value='+', line=3, column=23)
1681    Token(type='ID', value='price', line=3, column=25)
1682    Token(type='OP', value='*', line=3, column=31)
1683    Token(type='ID', value='quantity', line=3, column=33)
1684    Token(type='END', value=';', line=3, column=41)
1685    Token(type='ID', value='tax', line=4, column=8)
1686    Token(type='ASSIGN', value=':=', line=4, column=12)
1687    Token(type='ID', value='price', line=4, column=15)
1688    Token(type='OP', value='*', line=4, column=21)
1689    Token(type='NUMBER', value=0.05, line=4, column=23)
1690    Token(type='END', value=';', line=4, column=27)
1691    Token(type='ENDIF', value='ENDIF', line=5, column=4)
1692    Token(type='END', value=';', line=5, column=9)
1693
1694
1695.. [Frie09] Friedl, Jeffrey. Mastering Regular Expressions. 3rd ed., O'Reilly
1696   Media, 2009. The third edition of the book no longer covers Python at all,
1697   but the first edition covered writing good regular expression patterns in
1698   great detail.
1699