xref: /openbsd/gnu/usr.bin/perl/pod/perlreref.pod (revision 3d61058a)
1=head1 NAME
2
3perlreref - Perl Regular Expressions Reference
4
5=head1 DESCRIPTION
6
7This is a quick reference to Perl's regular expressions.
8For full information see L<perlre> and L<perlop>, as well
9as the L</"SEE ALSO"> section in this document.
10
11=head2 OPERATORS
12
13C<=~> determines to which variable the regex is applied.
14In its absence, $_ is used.
15
16    $var =~ /foo/;
17
18C<!~> determines to which variable the regex is applied,
19and negates the result of the match; it returns
20false if the match succeeds, and true if it fails.
21
22    $var !~ /foo/;
23
24C<m/pattern/msixpogcdualn> searches a string for a pattern match,
25applying the given options.
26
27    m  Multiline mode - ^ and $ match internal lines
28    s  match as a Single line - . matches \n
29    i  case-Insensitive
30    x  eXtended legibility - free whitespace and comments
31    p  Preserve a copy of the matched string -
32       ${^PREMATCH}, ${^MATCH}, ${^POSTMATCH} will be defined.
33    o  compile pattern Once
34    g  Global - all occurrences
35    c  don't reset pos on failed matches when using /g
36    a  restrict \d, \s, \w and [:posix:] to match ASCII only
37    aa (two a's) also /i matches exclude ASCII/non-ASCII
38    l  match according to current locale
39    u  match according to Unicode rules
40    d  match according to native rules unless something indicates
41       Unicode
42    n  Non-capture mode. Don't let () fill in $1, $2, etc...
43
44If 'pattern' is an empty string, the last I<successfully> matched
45regex is used. Delimiters other than '/' may be used for both this
46operator and the following ones. The leading C<m> can be omitted
47if the delimiter is '/'.
48
49C<qr/pattern/msixpodualn> lets you store a regex in a variable,
50or pass one around. Modifiers as for C<m//>, and are stored
51within the regex.
52
53C<s/pattern/replacement/msixpogcedual> substitutes matches of
54'pattern' with 'replacement'. Modifiers as for C<m//>,
55with two additions:
56
57    e  Evaluate 'replacement' as an expression
58    r  Return substitution and leave the original string untouched.
59
60'e' may be specified multiple times. 'replacement' is interpreted
61as a double quoted string unless a single-quote (C<'>) is the delimiter.
62
63C<m?pattern?> is like C<m/pattern/> but matches only once. No alternate
64delimiters can be used.  Must be reset with reset().
65
66=head2 SYNTAX
67
68 \       Escapes the character immediately following it
69 .       Matches any single character except a newline (unless /s is
70           used)
71 ^       Matches at the beginning of the string (or line, if /m is used)
72 $       Matches at the end of the string (or line, if /m is used)
73 *       Matches the preceding element 0 or more times
74 +       Matches the preceding element 1 or more times
75 ?       Matches the preceding element 0 or 1 times
76 {...}   Specifies a range of occurrences for the element preceding it
77 [...]   Matches any one of the characters contained within the brackets
78 (...)   Groups subexpressions for capturing to $1, $2...
79 (?:...) Groups subexpressions without capturing (cluster)
80 |       Matches either the subexpression preceding or following it
81 \g1 or \g{1}, \g2 ...    Matches the text from the Nth group
82 \1, \2, \3 ...           Matches the text from the Nth group
83 \g-1 or \g{-1}, \g-2 ... Matches the text from the Nth previous group
84 \g{name}     Named backreference
85 \k<name>     Named backreference
86 \k'name'     Named backreference
87 (?P=name)    Named backreference (python syntax)
88
89=head2 ESCAPE SEQUENCES
90
91These work as in normal strings.
92
93   \a       Alarm (beep)
94   \e       Escape
95   \f       Formfeed
96   \n       Newline
97   \r       Carriage return
98   \t       Tab
99   \037     Char whose ordinal is the 3 octal digits, max \777
100   \o{2307} Char whose ordinal is the octal number, unrestricted
101   \x7f     Char whose ordinal is the 2 hex digits, max \xFF
102   \x{263a} Char whose ordinal is the hex number, unrestricted
103   \cx      Control-x
104   \N{name} A named Unicode character or character sequence
105   \N{U+263D} A Unicode character by hex ordinal
106
107   \l  Lowercase next character
108   \u  Titlecase next character
109   \L  Lowercase until \E
110   \U  Uppercase until \E
111   \F  Foldcase until \E
112   \Q  Disable pattern metacharacters until \E
113   \E  End modification
114
115For Titlecase, see L</Titlecase>.
116
117This one works differently from normal strings:
118
119   \b  An assertion, not backspace, except in a character class
120
121=head2 CHARACTER CLASSES
122
123   [amy]    Match 'a', 'm' or 'y'
124   [f-j]    Dash specifies "range"
125   [f-j-]   Dash escaped or at start or end means 'dash'
126   [^f-j]   Caret indicates "match any character _except_ these"
127
128The following sequences (except C<\N>) work within or without a character class.
129The first six are locale aware, all are Unicode aware. See L<perllocale>
130and L<perlunicode> for details.
131
132   \d      A digit
133   \D      A nondigit
134   \w      A word character
135   \W      A non-word character
136   \s      A whitespace character
137   \S      A non-whitespace character
138   \h      A horizontal whitespace
139   \H      A non horizontal whitespace
140   \N      A non newline (when not followed by '{NAME}';;
141           not valid in a character class; equivalent to [^\n]; it's
142           like '.' without /s modifier)
143   \v      A vertical whitespace
144   \V      A non vertical whitespace
145   \R      A generic newline           (?>\v|\x0D\x0A)
146
147   \pP     Match P-named (Unicode) property
148   \p{...} Match Unicode property with name longer than 1 character
149   \PP     Match non-P
150   \P{...} Match lack of Unicode property with name longer than 1 char
151   \X      Match Unicode extended grapheme cluster
152
153POSIX character classes and their Unicode and Perl equivalents:
154
155            ASCII-         Full-
156   POSIX    range          range    backslash
157 [[:...:]]  \p{...}        \p{...}   sequence    Description
158
159 -----------------------------------------------------------------------
160 alnum   PosixAlnum       XPosixAlnum            'alpha' plus 'digit'
161 alpha   PosixAlpha       XPosixAlpha            Alphabetic characters
162 ascii   ASCII                                   Any ASCII character
163 blank   PosixBlank       XPosixBlank   \h       Horizontal whitespace;
164                                                   full-range also
165                                                   written as
166                                                   \p{HorizSpace}
167 cntrl   PosixCntrl       XPosixCntrl            Control characters
168 digit   PosixDigit       XPosixDigit   \d       Decimal digits
169 graph   PosixGraph       XPosixGraph            'alnum' plus 'punct'
170 lower   PosixLower       XPosixLower            Lowercase characters
171 print   PosixPrint       XPosixPrint            'graph' plus 'space',
172                                                   but not any Controls
173 punct   PosixPunct       XPosixPunct            Punctuation and Symbols
174                                                   in ASCII-range; just
175                                                   punct outside it
176 space   PosixSpace       XPosixSpace   \s       Whitespace
177 upper   PosixUpper       XPosixUpper            Uppercase characters
178 word    PosixWord        XPosixWord    \w       'alnum' + Unicode marks
179                                                    + connectors, like
180                                                    '_' (Perl extension)
181 xdigit  ASCII_Hex_Digit  XPosixDigit            Hexadecimal digit,
182                                                    ASCII-range is
183                                                    [0-9A-Fa-f]
184
185Also, various synonyms like C<\p{Alpha}> for C<\p{XPosixAlpha}>; all listed
186in L<perluniprops/Properties accessible through \p{} and \P{}>
187
188Within a character class:
189
190    POSIX      traditional   Unicode
191  [:digit:]       \d        \p{Digit}
192  [:^digit:]      \D        \P{Digit}
193
194=head2 ANCHORS
195
196All are zero-width assertions.
197
198   ^  Match string start (or line, if /m is used)
199   $  Match string end (or line, if /m is used) or before newline
200   \b{} Match boundary of type specified within the braces
201   \B{} Match wherever \b{} doesn't match
202   \b Match word boundary (between \w and \W)
203   \B Match except at word boundary (between \w and \w or \W and \W)
204   \A Match string start (regardless of /m)
205   \Z Match string end (before optional newline)
206   \z Match absolute string end
207   \G Match where previous m//g left off
208   \K Keep the stuff left of the \K, don't include it in $&
209
210=head2 QUANTIFIERS
211
212Quantifiers are greedy by default and match the B<longest> leftmost.
213
214   Maximal Minimal Possessive Allowed range
215   ------- ------- ---------- -------------
216   {n,m}   {n,m}?  {n,m}+     Must occur at least n times
217                              but no more than m times
218   {n,}    {n,}?   {n,}+      Must occur at least n times
219   {,n}    {,n}?   {,n}+      Must occur at most n times
220   {n}     {n}?    {n}+       Must occur exactly n times
221   *       *?      *+         0 or more times (same as {0,})
222   +       +?      ++         1 or more times (same as {1,})
223   ?       ??      ?+         0 or 1 time (same as {0,1})
224
225The possessive forms (new in Perl 5.10) prevent backtracking: what gets
226matched by a pattern with a possessive quantifier will not be backtracked
227into, even if that causes the whole match to fail.
228
229=head2 EXTENDED CONSTRUCTS
230
231   (?#text)          A comment
232   (?:...)           Groups subexpressions without capturing (cluster)
233   (?pimsx-imsx:...) Enable/disable option (as per m// modifiers)
234   (?=...)           Zero-width positive lookahead assertion
235   (*pla:...)        Same, starting in 5.32; experimentally in 5.28
236   (*positive_lookahead:...) Same, same versions as *pla
237   (?!...)           Zero-width negative lookahead assertion
238   (*nla:...)        Same, starting in 5.32; experimentally in 5.28
239   (*negative_lookahead:...) Same, same versions as *nla
240   (?<=...)          Zero-width positive lookbehind assertion
241   (*plb:...)        Same, starting in 5.32; experimentally in 5.28
242   (*positive_lookbehind:...) Same, same versions as *plb
243   (?<!...)          Zero-width negative lookbehind assertion
244   (*nlb:...)        Same, starting in 5.32; experimentally in 5.28
245   (*negative_lookbehind:...) Same, same versions as *plb
246   (?>...)           Grab what we can, prohibit backtracking
247   (*atomic:...)     Same, starting in 5.32; experimentally in 5.28
248   (?|...)           Branch reset
249   (?<name>...)      Named capture
250   (?'name'...)      Named capture
251   (?P<name>...)     Named capture (python syntax)
252   (?[...])          Extended bracketed character class
253   (?{ code })       Embedded code, return value becomes $^R
254   (??{ code })      Dynamic regex, return value used as regex
255   (?N)              Recurse into subpattern number N
256   (?-N), (?+N)      Recurse into Nth previous/next subpattern
257   (?R), (?0)        Recurse at the beginning of the whole pattern
258   (?&name)          Recurse into a named subpattern
259   (?P>name)         Recurse into a named subpattern (python syntax)
260   (?(cond)yes|no)
261   (?(cond)yes)      Conditional expression, where "(cond)" can be:
262                     (?=pat)   lookahead; also (*pla:pat)
263                               (*positive_lookahead:pat)
264                     (?!pat)   negative lookahead; also (*nla:pat)
265                               (*negative_lookahead:pat)
266                     (?<=pat)  lookbehind; also (*plb:pat)
267                               (*lookbehind:pat)
268                     (?<!pat)  negative lookbehind; also (*nlb:pat)
269                               (*negative_lookbehind:pat)
270                     (N)       subpattern N has matched something
271                     (<name>)  named subpattern has matched something
272                     ('name')  named subpattern has matched something
273                     (?{code}) code condition
274                     (R)       true if recursing
275                     (RN)      true if recursing into Nth subpattern
276                     (R&name)  true if recursing into named subpattern
277                     (DEFINE)  always false, no no-pattern allowed
278
279=head2 VARIABLES
280
281   $_    Default variable for operators to use
282
283   $`    Everything prior to matched string
284   $&    Entire matched string
285   $'    Everything after to matched string
286
287   ${^PREMATCH}   Everything prior to matched string
288   ${^MATCH}      Entire matched string
289   ${^POSTMATCH}  Everything after to matched string
290
291Note to those still using Perl 5.18 or earlier:
292The use of C<$`>, C<$&> or C<$'> will slow down B<all> regex use
293within your program. Consult L<perlvar> for C<@->
294to see equivalent expressions that won't cause slow down.
295See also L<Devel::SawAmpersand>. Starting with Perl 5.10, you
296can also use the equivalent variables C<${^PREMATCH}>, C<${^MATCH}>
297and C<${^POSTMATCH}>, but for them to be defined, you have to
298specify the C</p> (preserve) modifier on your regular expression.
299In Perl 5.20, the use of C<$`>, C<$&> and C<$'> makes no speed difference.
300
301   $1, $2 ...  hold the Xth captured expr
302   $+    Last parenthesized pattern match
303   $^N   Holds the most recently closed capture
304   $^R   Holds the result of the last (?{...}) expr
305   @-    Offsets of starts of groups. $-[0] holds start of whole match
306   @+    Offsets of ends of groups. $+[0] holds end of whole match
307   %+    Named capture groups
308   %-    Named capture groups, as array refs
309
310Captured groups are numbered according to their I<opening> paren.
311
312=head2 FUNCTIONS
313
314   lc          Lowercase a string
315   lcfirst     Lowercase first char of a string
316   uc          Uppercase a string
317   ucfirst     Titlecase first char of a string
318   fc          Foldcase a string
319
320   pos         Return or set current match position
321   quotemeta   Quote metacharacters
322   reset       Reset m?pattern? status
323   study       Analyze string for optimizing matching
324
325   split       Use a regex to split a string into parts
326
327The first five of these are like the escape sequences C<\L>, C<\l>,
328C<\U>, C<\u>, and C<\F>.  For Titlecase, see L</Titlecase>; For
329Foldcase, see L</Foldcase>.
330
331=head2 TERMINOLOGY
332
333=head3 Titlecase
334
335Unicode concept which most often is equal to uppercase, but for
336certain characters like the German "sharp s" there is a difference.
337
338=head3 Foldcase
339
340Unicode form that is useful when comparing strings regardless of case,
341as certain characters have complex one-to-many case mappings. Primarily a
342variant of lowercase.
343
344=head1 AUTHOR
345
346Iain Truskett. Updated by the Perl 5 Porters.
347
348This document may be distributed under the same terms as Perl itself.
349
350=head1 SEE ALSO
351
352=over 4
353
354=item *
355
356L<perlretut> for a tutorial on regular expressions.
357
358=item *
359
360L<perlrequick> for a rapid tutorial.
361
362=item *
363
364L<perlre> for more details.
365
366=item *
367
368L<perlvar> for details on the variables.
369
370=item *
371
372L<perlop> for details on the operators.
373
374=item *
375
376L<perlfunc> for details on the functions.
377
378=item *
379
380L<perlfaq6> for FAQs on regular expressions.
381
382=item *
383
384L<perlrebackslash> for a reference on backslash sequences.
385
386=item *
387
388L<perlrecharclass> for a reference on character classes.
389
390=item *
391
392The L<re> module to alter behaviour and aid
393debugging.
394
395=item *
396
397L<perldebug/"Debugging Regular Expressions">
398
399=item *
400
401L<perluniintro>, L<perlunicode>, L<charnames> and L<perllocale>
402for details on regexes and internationalisation.
403
404=item *
405
406I<Mastering Regular Expressions> by Jeffrey Friedl
407(L<https://www.oreilly.com/library/view/-/0596528124/>) for a thorough grounding and
408reference on the topic.
409
410=back
411
412=head1 THANKS
413
414David P.C. Wollmann,
415Richard Soderberg,
416Sean M. Burke,
417Tom Christiansen,
418Jim Cromie,
419and
420Jeffrey Goff
421for useful advice.
422
423=cut
424