1=head1 NAME
2
3perlfaq6 - Regular Expressions
4
5=head1 VERSION
6
7version 5.20240218
8
9=head1 DESCRIPTION
10
11This section is surprisingly small because the rest of the FAQ is
12littered with answers involving regular expressions. For example,
13decoding a URL and checking whether something is a number can be handled
14with regular expressions, but those answers are found elsewhere in
15this document (in L<perlfaq9>: "How do I decode or create those %-encodings
16on the web" and L<perlfaq4>: "How do I determine whether a scalar is
17a number/whole/integer/float", to be precise).
18
19=head2 How can I hope to use regular expressions without creating illegible and unmaintainable code?
20X<regex, legibility> X<regexp, legibility>
21X<regular expression, legibility> X</x>
22
23Three techniques can make regular expressions maintainable and
24understandable.
25
26=over 4
27
28=item Comments Outside the Regex
29
30Describe what you're doing and how you're doing it, using normal Perl
31comments.
32
33    # turn the line into the first word, a colon, and the
34    # number of characters on the rest of the line
35    s/^(\w+)(.*)/ lc($1) . ":" . length($2) /meg;
36
37=item Comments Inside the Regex
38
39The C</x> modifier causes whitespace to be ignored in a regex pattern
40(except in a character class and a few other places), and also allows you to
41use normal comments there, too. As you can imagine, whitespace and comments
42help a lot.
43
44C</x> lets you turn this:
45
46    s{<(?:[^>'"]*|".*?"|'.*?')+>}{}gs;
47
48into this:
49
50    s{ <                    # opening angle bracket
51        (?:                 # Non-backreffing grouping paren
52            [^>'"] *        # 0 or more things that are neither > nor ' nor "
53                |           #    or else
54            ".*?"           # a section between double quotes (stingy match)
55                |           #    or else
56            '.*?'           # a section between single quotes (stingy match)
57        ) +                 #   all occurring one or more times
58        >                   # closing angle bracket
59    }{}gsx;                 # replace with nothing, i.e. delete
60
61It's still not quite so clear as prose, but it is very useful for
62describing the meaning of each part of the pattern.
63
64=item Different Delimiters
65
66While we normally think of patterns as being delimited with C</>
67characters, they can be delimited by almost any character. L<perlre>
68describes this. For example, the C<s///> above uses braces as
69delimiters. Selecting another delimiter can avoid quoting the
70delimiter within the pattern:
71
72    s/\/usr\/local/\/usr\/share/g;    # bad delimiter choice
73    s#/usr/local#/usr/share#g;        # better
74
75Using logically paired delimiters can be even more readable:
76
77    s{/usr/local/}{/usr/share}g;      # better still
78
79=back
80
81=head2 I'm having trouble matching over more than one line. What's wrong?
82X<regex, multiline> X<regexp, multiline> X<regular expression, multiline>
83
84Either you don't have more than one line in the string you're looking
85at (probably), or else you aren't using the correct modifier(s) on
86your pattern (possibly).
87
88There are many ways to get multiline data into a string. If you want
89it to happen automatically while reading input, you'll want to set $/
90(probably to '' for paragraphs or C<undef> for the whole file) to
91allow you to read more than one line at a time.
92
93Read L<perlre> to help you decide which of C</s> and C</m> (or both)
94you might want to use: C</s> allows dot to include newline, and C</m>
95allows caret and dollar to match next to a newline, not just at the
96end of the string. You do need to make sure that you've actually
97got a multiline string in there.
98
99For example, this program detects duplicate words, even when they span
100line breaks (but not paragraph ones). For this example, we don't need
101C</s> because we aren't using dot in a regular expression that we want
102to cross line boundaries. Neither do we need C</m> because we don't
103want caret or dollar to match at any point inside the record next
104to newlines. But it's imperative that $/ be set to something other
105than the default, or else we won't actually ever have a multiline
106record read in.
107
108    $/ = '';          # read in whole paragraph, not just one line
109    while ( <> ) {
110        while ( /\b([\w'-]+)(\s+\g1)+\b/gi ) {     # word starts alpha
111            print "Duplicate $1 at paragraph $.\n";
112        }
113    }
114
115Here's some code that finds sentences that begin with "From " (which would
116be mangled by many mailers):
117
118    $/ = '';          # read in whole paragraph, not just one line
119    while ( <> ) {
120        while ( /^From /gm ) { # /m makes ^ match next to \n
121        print "leading From in paragraph $.\n";
122        }
123    }
124
125Here's code that finds everything between START and END in a paragraph:
126
127    undef $/;          # read in whole file, not just one line or paragraph
128    while ( <> ) {
129        while ( /START(.*?)END/sgm ) { # /s makes . cross line boundaries
130            print "$1\n";
131        }
132    }
133
134=head2 How can I pull out lines between two patterns that are themselves on different lines?
135X<..>
136
137You can use Perl's somewhat exotic C<..> operator (documented in
138L<perlop>):
139
140    perl -ne 'print if /START/ .. /END/' file1 file2 ...
141
142If you wanted text and not lines, you would use
143
144    perl -0777 -ne 'print "$1\n" while /START(.*?)END/gs' file1 file2 ...
145
146But if you want nested occurrences of C<START> through C<END>, you'll
147run up against the problem described in the question in this section
148on matching balanced text.
149
150Here's another example of using C<..>:
151
152    while (<>) {
153        my $in_header =   1  .. /^$/;
154        my $in_body   = /^$/ .. eof;
155    # now choose between them
156    } continue {
157        $. = 0 if eof;    # fix $.
158    }
159
160=head2 How do I match XML, HTML, or other nasty, ugly things with a regex?
161X<regex, XML> X<regex, HTML> X<XML> X<HTML> X<pain> X<frustration>
162X<sucking out, will to live>
163
164Do not use regexes. Use a module and forget about the
165regular expressions. The L<XML::LibXML>, L<HTML::TokeParser> and
166L<HTML::TreeBuilder> modules are good starts, although each namespace
167has other parsing modules specialized for certain tasks and different
168ways of doing it. Start at CPAN Search ( L<http://metacpan.org/> )
169and wonder at all the work people have done for you already! :)
170
171=head2 I put a regular expression into $/ but it didn't work. What's wrong?
172X<$/, regexes in> X<$INPUT_RECORD_SEPARATOR, regexes in>
173X<$RS, regexes in>
174
175$/ has to be a string. You can use these examples if you really need to
176do this.
177
178If you have L<File::Stream>, this is easy.
179
180    use File::Stream;
181
182    my $stream = File::Stream->new(
183        $filehandle,
184        separator => qr/\s*,\s*/,
185        );
186
187    print "$_\n" while <$stream>;
188
189If you don't have File::Stream, you have to do a little more work.
190
191You can use the four-argument form of sysread to continually add to
192a buffer. After you add to the buffer, you check if you have a
193complete line (using your regular expression).
194
195    local $_ = "";
196    while( sysread FH, $_, 8192, length ) {
197        while( s/^((?s).*?)your_pattern// ) {
198            my $record = $1;
199            # do stuff here.
200        }
201    }
202
203You can do the same thing with foreach and a match using the
204c flag and the \G anchor, if you do not mind your entire file
205being in memory at the end.
206
207    local $_ = "";
208    while( sysread FH, $_, 8192, length ) {
209        foreach my $record ( m/\G((?s).*?)your_pattern/gc ) {
210            # do stuff here.
211        }
212        substr( $_, 0, pos ) = "" if pos;
213    }
214
215
216=head2 How do I substitute case-insensitively on the LHS while preserving case on the RHS?
217X<replace, case preserving> X<substitute, case preserving>
218X<substitution, case preserving> X<s, case preserving>
219
220Here's a lovely Perlish solution by Larry Rosler. It exploits
221properties of bitwise xor on ASCII strings.
222
223    $_= "this is a TEsT case";
224
225    $old = 'test';
226    $new = 'success';
227
228    s{(\Q$old\E)}
229    { uc $new | (uc $1 ^ $1) .
230        (uc(substr $1, -1) ^ substr $1, -1) x
231        (length($new) - length $1)
232    }egi;
233
234    print;
235
236And here it is as a subroutine, modeled after the above:
237
238    sub preserve_case {
239        my ($old, $new) = @_;
240        my $mask = uc $old ^ $old;
241
242        uc $new | $mask .
243            substr($mask, -1) x (length($new) - length($old))
244    }
245
246    $string = "this is a TEsT case";
247    $string =~ s/(test)/preserve_case($1, "success")/egi;
248    print "$string\n";
249
250This prints:
251
252    this is a SUcCESS case
253
254As an alternative, to keep the case of the replacement word if it is
255longer than the original, you can use this code, by Jeff Pinyan:
256
257    sub preserve_case {
258        my ($from, $to) = @_;
259        my ($lf, $lt) = map length, @_;
260
261        if ($lt < $lf) { $from = substr $from, 0, $lt }
262        else { $from .= substr $to, $lf }
263
264        return uc $to | ($from ^ uc $from);
265    }
266
267This changes the sentence to "this is a SUcCess case."
268
269Just to show that C programmers can write C in any programming language,
270if you prefer a more C-like solution, the following script makes the
271substitution have the same case, letter by letter, as the original.
272(It also happens to run about 240% slower than the Perlish solution runs.)
273If the substitution has more characters than the string being substituted,
274the case of the last character is used for the rest of the substitution.
275
276    # Original by Nathan Torkington, massaged by Jeffrey Friedl
277    #
278    sub preserve_case
279    {
280        my ($old, $new) = @_;
281        my $state = 0; # 0 = no change; 1 = lc; 2 = uc
282        my ($i, $oldlen, $newlen, $c) = (0, length($old), length($new));
283        my $len = $oldlen < $newlen ? $oldlen : $newlen;
284
285        for ($i = 0; $i < $len; $i++) {
286            if ($c = substr($old, $i, 1), $c =~ /[\W\d_]/) {
287                $state = 0;
288            } elsif (lc $c eq $c) {
289                substr($new, $i, 1) = lc(substr($new, $i, 1));
290                $state = 1;
291            } else {
292                substr($new, $i, 1) = uc(substr($new, $i, 1));
293                $state = 2;
294            }
295        }
296        # finish up with any remaining new (for when new is longer than old)
297        if ($newlen > $oldlen) {
298            if ($state == 1) {
299                substr($new, $oldlen) = lc(substr($new, $oldlen));
300            } elsif ($state == 2) {
301                substr($new, $oldlen) = uc(substr($new, $oldlen));
302            }
303        }
304        return $new;
305    }
306
307=head2 How can I make C<\w> match national character sets?
308X<\w>
309
310Put C<use locale;> in your script. The \w character class is taken
311from the current locale.
312
313See L<perllocale> for details.
314
315=head2 How can I match a locale-smart version of C</[a-zA-Z]/>?
316X<alpha>
317
318You can use the POSIX character class syntax C</[[:alpha:]]/>
319documented in L<perlre>.
320
321No matter which locale you are in, the alphabetic characters are
322the characters in \w without the digits and the underscore.
323As a regex, that looks like C</[^\W\d_]/>. Its complement,
324the non-alphabetics, is then everything in \W along with
325the digits and the underscore, or C</[\W\d_]/>.
326
327=head2 How can I quote a variable to use in a regex?
328X<regex, escaping> X<regexp, escaping> X<regular expression, escaping>
329
330The Perl parser will expand $variable and @variable references in
331regular expressions unless the delimiter is a single quote. Remember,
332too, that the right-hand side of a C<s///> substitution is considered
333a double-quoted string (see L<perlop> for more details). Remember
334also that any regex special characters will be acted on unless you
335precede the substitution with \Q. Here's an example:
336
337    $string = "Placido P. Octopus";
338    $regex  = "P.";
339
340    $string =~ s/$regex/Polyp/;
341    # $string is now "Polypacido P. Octopus"
342
343Because C<.> is special in regular expressions, and can match any
344single character, the regex C<P.> here has matched the <Pl> in the
345original string.
346
347To escape the special meaning of C<.>, we use C<\Q>:
348
349    $string = "Placido P. Octopus";
350    $regex  = "P.";
351
352    $string =~ s/\Q$regex/Polyp/;
353    # $string is now "Placido Polyp Octopus"
354
355The use of C<\Q> causes the C<.> in the regex to be treated as a
356regular character, so that C<P.> matches a C<P> followed by a dot.
357
358=head2 What is C</o> really for?
359X</o, regular expressions> X<compile, regular expressions>
360
361(contributed by brian d foy)
362
363The C</o> option for regular expressions (documented in L<perlop> and
364L<perlreref>) tells Perl to compile the regular expression only once.
365This is only useful when the pattern contains a variable. Perls 5.6
366and later handle this automatically if the pattern does not change.
367
368Since the match operator C<m//>, the substitution operator C<s///>,
369and the regular expression quoting operator C<qr//> are double-quotish
370constructs, you can interpolate variables into the pattern. See the
371answer to "How can I quote a variable to use in a regex?" for more
372details.
373
374This example takes a regular expression from the argument list and
375prints the lines of input that match it:
376
377    my $pattern = shift @ARGV;
378
379    while( <> ) {
380        print if m/$pattern/;
381    }
382
383Versions of Perl prior to 5.6 would recompile the regular expression
384for each iteration, even if C<$pattern> had not changed. The C</o>
385would prevent this by telling Perl to compile the pattern the first
386time, then reuse that for subsequent iterations:
387
388    my $pattern = shift @ARGV;
389
390    while( <> ) {
391        print if m/$pattern/o; # useful for Perl < 5.6
392    }
393
394In versions 5.6 and later, Perl won't recompile the regular expression
395if the variable hasn't changed, so you probably don't need the C</o>
396option. It doesn't hurt, but it doesn't help either. If you want any
397version of Perl to compile the regular expression only once even if
398the variable changes (thus, only using its initial value), you still
399need the C</o>.
400
401You can watch Perl's regular expression engine at work to verify for
402yourself if Perl is recompiling a regular expression. The C<use re
403'debug'> pragma (comes with Perl 5.005 and later) shows the details.
404With Perls before 5.6, you should see C<re> reporting that its
405compiling the regular expression on each iteration. With Perl 5.6 or
406later, you should only see C<re> report that for the first iteration.
407
408    use re 'debug';
409
410    my $regex = 'Perl';
411    foreach ( qw(Perl Java Ruby Python) ) {
412        print STDERR "-" x 73, "\n";
413        print STDERR "Trying $_...\n";
414        print STDERR "\t$_ is good!\n" if m/$regex/;
415    }
416
417=head2 How do I use a regular expression to strip C-style comments from a file?
418
419While this actually can be done, it's much harder than you'd think.
420For example, this one-liner
421
422    perl -0777 -pe 's{/\*.*?\*/}{}gs' foo.c
423
424will work in many but not all cases. You see, it's too simple-minded for
425certain kinds of C programs, in particular, those with what appear to be
426comments in quoted strings. For that, you'd need something like this,
427created by Jeffrey Friedl and later modified by Fred Curtis.
428
429    $/ = undef;
430    $_ = <>;
431    s#/\*[^*]*\*+([^/*][^*]*\*+)*/|("(\\.|[^"\\])*"|'(\\.|[^'\\])*'|.[^/"'\\]*)#defined $2 ? $2 : ""#gse;
432    print;
433
434This could, of course, be more legibly written with the C</x> modifier, adding
435whitespace and comments. Here it is expanded, courtesy of Fred Curtis.
436
437    s{
438       /\*         ##  Start of /* ... */ comment
439       [^*]*\*+    ##  Non-* followed by 1-or-more *'s
440       (
441         [^/*][^*]*\*+
442       )*          ##  0-or-more things which don't start with /
443                   ##    but do end with '*'
444       /           ##  End of /* ... */ comment
445
446     |         ##     OR  various things which aren't comments:
447
448       (
449         "           ##  Start of " ... " string
450         (
451           \\.           ##  Escaped char
452         |               ##    OR
453           [^"\\]        ##  Non "\
454         )*
455         "           ##  End of " ... " string
456
457       |         ##     OR
458
459         '           ##  Start of ' ... ' string
460         (
461           \\.           ##  Escaped char
462         |               ##    OR
463           [^'\\]        ##  Non '\
464         )*
465         '           ##  End of ' ... ' string
466
467       |         ##     OR
468
469         .           ##  Anything other char
470         [^/"'\\]*   ##  Chars which doesn't start a comment, string or escape
471       )
472     }{defined $2 ? $2 : ""}gxse;
473
474A slight modification also removes C++ comments, possibly spanning multiple lines
475using a continuation character:
476
477 s#/\*[^*]*\*+([^/*][^*]*\*+)*/|//([^\\]|[^\n][\n]?)*?\n|("(\\.|[^"\\])*"|'(\\.|[^'\\])*'|.[^/"'\\]*)#defined $3 ? $3 : ""#gse;
478
479=head2 Can I use Perl regular expressions to match balanced text?
480X<regex, matching balanced test> X<regexp, matching balanced test>
481X<regular expression, matching balanced test> X<possessive> X<PARNO>
482X<Text::Balanced> X<Regexp::Common> X<backtracking> X<recursion>
483
484(contributed by brian d foy)
485
486Your first try should probably be the L<Text::Balanced> module, which
487is in the Perl standard library since Perl 5.8. It has a variety of
488functions to deal with tricky text. The L<Regexp::Common> module can
489also help by providing canned patterns you can use.
490
491As of Perl 5.10, you can match balanced text with regular expressions
492using recursive patterns. Before Perl 5.10, you had to resort to
493various tricks such as using Perl code in C<(??{})> sequences.
494
495Here's an example using a recursive regular expression. The goal is to
496capture all of the text within angle brackets, including the text in
497nested angle brackets. This sample text has two "major" groups: a
498group with one level of nesting and a group with two levels of
499nesting. There are five total groups in angle brackets:
500
501    I have some <brackets in <nested brackets> > and
502    <another group <nested once <nested twice> > >
503    and that's it.
504
505The regular expression to match the balanced text uses two new (to
506Perl 5.10) regular expression features. These are covered in L<perlre>
507and this example is a modified version of one in that documentation.
508
509First, adding the new possessive C<+> to any quantifier finds the
510longest match and does not backtrack. That's important since you want
511to handle any angle brackets through the recursion, not backtracking.
512The group C<< [^<>]++ >> finds one or more non-angle brackets without
513backtracking.
514
515Second, the new C<(?PARNO)> refers to the sub-pattern in the
516particular capture group given by C<PARNO>. In the following regex,
517the first capture group finds (and remembers) the balanced text, and
518you need that same pattern within the first buffer to get past the
519nested text. That's the recursive part. The C<(?1)> uses the pattern
520in the outer capture group as an independent part of the regex.
521
522Putting it all together, you have:
523
524    #!/usr/local/bin/perl5.10.0
525
526    my $string =<<"HERE";
527    I have some <brackets in <nested brackets> > and
528    <another group <nested once <nested twice> > >
529    and that's it.
530    HERE
531
532    my @groups = $string =~ m/
533            (                   # start of capture group 1
534            <                   # match an opening angle bracket
535                (?:
536                    [^<>]++     # one or more non angle brackets, non backtracking
537                      |
538                    (?1)        # found < or >, so recurse to capture group 1
539                )*
540            >                   # match a closing angle bracket
541            )                   # end of capture group 1
542            /xg;
543
544    $" = "\n\t";
545    print "Found:\n\t@groups\n";
546
547The output shows that Perl found the two major groups:
548
549    Found:
550        <brackets in <nested brackets> >
551        <another group <nested once <nested twice> > >
552
553With a little extra work, you can get all of the groups in angle
554brackets even if they are in other angle brackets too. Each time you
555get a balanced match, remove its outer delimiter (that's the one you
556just matched so don't match it again) and add it to a queue of strings
557to process. Keep doing that until you get no matches:
558
559    #!/usr/local/bin/perl5.10.0
560
561    my @queue =<<"HERE";
562    I have some <brackets in <nested brackets> > and
563    <another group <nested once <nested twice> > >
564    and that's it.
565    HERE
566
567    my $regex = qr/
568            (                   # start of bracket 1
569            <                   # match an opening angle bracket
570                (?:
571                    [^<>]++     # one or more non angle brackets, non backtracking
572                      |
573                    (?1)        # recurse to bracket 1
574                )*
575            >                   # match a closing angle bracket
576            )                   # end of bracket 1
577            /x;
578
579    $" = "\n\t";
580
581    while( @queue ) {
582        my $string = shift @queue;
583
584        my @groups = $string =~ m/$regex/g;
585        print "Found:\n\t@groups\n\n" if @groups;
586
587        unshift @queue, map { s/^<//; s/>$//; $_ } @groups;
588    }
589
590The output shows all of the groups. The outermost matches show up
591first and the nested matches show up later:
592
593    Found:
594        <brackets in <nested brackets> >
595        <another group <nested once <nested twice> > >
596
597    Found:
598        <nested brackets>
599
600    Found:
601        <nested once <nested twice> >
602
603    Found:
604        <nested twice>
605
606=head2 What does it mean that regexes are greedy? How can I get around it?
607X<greedy> X<greediness>
608
609Most people mean that greedy regexes match as much as they can.
610Technically speaking, it's actually the quantifiers (C<?>, C<*>, C<+>,
611C<{}>) that are greedy rather than the whole pattern; Perl prefers local
612greed and immediate gratification to overall greed. To get non-greedy
613versions of the same quantifiers, use (C<??>, C<*?>, C<+?>, C<{}?>).
614
615An example:
616
617    my $s1 = my $s2 = "I am very very cold";
618    $s1 =~ s/ve.*y //;      # I am cold
619    $s2 =~ s/ve.*?y //;     # I am very cold
620
621Notice how the second substitution stopped matching as soon as it
622encountered "y ". The C<*?> quantifier effectively tells the regular
623expression engine to find a match as quickly as possible and pass
624control on to whatever is next in line, as you would if you were
625playing hot potato.
626
627=head2 How do I process each word on each line?
628X<word>
629
630Use the split function:
631
632    while (<>) {
633        foreach my $word ( split ) {
634            # do something with $word here
635        }
636    }
637
638Note that this isn't really a word in the English sense; it's just
639chunks of consecutive non-whitespace characters.
640
641To work with only alphanumeric sequences (including underscores), you
642might consider
643
644    while (<>) {
645        foreach $word (m/(\w+)/g) {
646            # do something with $word here
647        }
648    }
649
650=head2 How can I print out a word-frequency or line-frequency summary?
651
652To do this, you have to parse out each word in the input stream. We'll
653pretend that by word you mean chunk of alphabetics, hyphens, or
654apostrophes, rather than the non-whitespace chunk idea of a word given
655in the previous question:
656
657    my (%seen);
658    while (<>) {
659        while ( /(\b[^\W_\d][\w'-]+\b)/g ) {   # misses "`sheep'"
660            $seen{$1}++;
661        }
662    }
663
664    while ( my ($word, $count) = each %seen ) {
665        print "$count $word\n";
666    }
667
668If you wanted to do the same thing for lines, you wouldn't need a
669regular expression:
670
671    my (%seen);
672
673    while (<>) {
674        $seen{$_}++;
675    }
676
677    while ( my ($line, $count) = each %seen ) {
678        print "$count $line";
679    }
680
681If you want these output in a sorted order, see L<perlfaq4>: "How do I
682sort a hash (optionally by value instead of key)?".
683
684=head2 How can I do approximate matching?
685X<match, approximate> X<matching, approximate>
686
687See the module L<String::Approx> available from CPAN.
688
689=head2 How do I efficiently match many regular expressions at once?
690X<regex, efficiency> X<regexp, efficiency>
691X<regular expression, efficiency>
692
693(contributed by brian d foy)
694
695You want to
696avoid compiling a regular expression every time you want to match it.
697In this example, perl must recompile the regular expression for every
698iteration of the C<foreach> loop since C<$pattern> can change:
699
700    my @patterns = qw( fo+ ba[rz] );
701
702    LINE: while( my $line = <> ) {
703        foreach my $pattern ( @patterns ) {
704            if( $line =~ m/\b$pattern\b/i ) {
705                print $line;
706                next LINE;
707            }
708        }
709    }
710
711The C<qr//> operator compiles a regular
712expression, but doesn't apply it. When you use the pre-compiled
713version of the regex, perl does less work. In this example, I inserted
714a C<map> to turn each pattern into its pre-compiled form. The rest of
715the script is the same, but faster:
716
717    my @patterns = map { qr/\b$_\b/i } qw( fo+ ba[rz] );
718
719    LINE: while( my $line = <> ) {
720        foreach my $pattern ( @patterns ) {
721            if( $line =~ m/$pattern/ ) {
722                print $line;
723                next LINE;
724            }
725        }
726    }
727
728In some cases, you may be able to make several patterns into a single
729regular expression. Beware of situations that require backtracking
730though. In this example, the regex is only compiled once because
731C<$regex> doesn't change between iterations:
732
733    my $regex = join '|', qw( fo+ ba[rz] );
734
735    while( my $line = <> ) {
736        print if $line =~ m/\b(?:$regex)\b/i;
737    }
738
739The function L<Data::Munge/list2re> on CPAN can also be used to form
740a single regex that matches a list of literal strings (not regexes).
741
742For more details on regular expression efficiency, see I<Mastering
743Regular Expressions> by Jeffrey Friedl. He explains how the regular
744expressions engine works and why some patterns are surprisingly
745inefficient. Once you understand how perl applies regular expressions,
746you can tune them for individual situations.
747
748=head2 Why don't word-boundary searches with C<\b> work for me?
749X<\b>
750
751(contributed by brian d foy)
752
753Ensure that you know what \b really does: it's the boundary between a
754word character, \w, and something that isn't a word character. That
755thing that isn't a word character might be \W, but it can also be the
756start or end of the string.
757
758It's not (not!) the boundary between whitespace and non-whitespace,
759and it's not the stuff between words we use to create sentences.
760
761In regex speak, a word boundary (\b) is a "zero width assertion",
762meaning that it doesn't represent a character in the string, but a
763condition at a certain position.
764
765For the regular expression, /\bPerl\b/, there has to be a word
766boundary before the "P" and after the "l". As long as something other
767than a word character precedes the "P" and succeeds the "l", the
768pattern will match. These strings match /\bPerl\b/.
769
770    "Perl"    # no word char before "P" or after "l"
771    "Perl "   # same as previous (space is not a word char)
772    "'Perl'"  # the "'" char is not a word char
773    "Perl's"  # no word char before "P", non-word char after "l"
774
775These strings do not match /\bPerl\b/.
776
777    "Perl_"   # "_" is a word char!
778    "Perler"  # no word char before "P", but one after "l"
779
780You don't have to use \b to match words though. You can look for
781non-word characters surrounded by word characters. These strings
782match the pattern /\b'\b/.
783
784    "don't"   # the "'" char is surrounded by "n" and "t"
785    "qep'a'"  # the "'" char is surrounded by "p" and "a"
786
787These strings do not match /\b'\b/.
788
789    "foo'"    # there is no word char after non-word "'"
790
791You can also use the complement of \b, \B, to specify that there
792should not be a word boundary.
793
794In the pattern /\Bam\B/, there must be a word character before the "a"
795and after the "m". These patterns match /\Bam\B/:
796
797    "llama"   # "am" surrounded by word chars
798    "Samuel"  # same
799
800These strings do not match /\Bam\B/
801
802    "Sam"      # no word boundary before "a", but one after "m"
803    "I am Sam" # "am" surrounded by non-word chars
804
805
806=head2 Why does using $&, $`, or $' slow my program down?
807X<$MATCH> X<$&> X<$POSTMATCH> X<$'> X<$PREMATCH> X<$`>
808
809(contributed by Anno Siegel)
810
811Once Perl sees that you need one of these variables anywhere in the
812program, it provides them on each and every pattern match. That means
813that on every pattern match the entire string will be copied, part of it
814to $`, part to $&, and part to $'. Thus the penalty is most severe with
815long strings and patterns that match often. Avoid $&, $', and $` if you
816can, but if you can't, once you've used them at all, use them at will
817because you've already paid the price. Remember that some algorithms
818really appreciate them. As of the 5.005 release, the $& variable is no
819longer "expensive" the way the other two are.
820
821Since Perl 5.6.1 the special variables @- and @+ can functionally replace
822$`, $& and $'. These arrays contain pointers to the beginning and end
823of each match (see perlvar for the full story), so they give you
824essentially the same information, but without the risk of excessive
825string copying.
826
827Perl 5.10 added three specials, C<${^MATCH}>, C<${^PREMATCH}>, and
828C<${^POSTMATCH}> to do the same job but without the global performance
829penalty. Perl 5.10 only sets these variables if you compile or execute the
830regular expression with the C</p> modifier.
831
832=head2 What good is C<\G> in a regular expression?
833X<\G>
834
835You use the C<\G> anchor to start the next match on the same
836string where the last match left off. The regular
837expression engine cannot skip over any characters to find
838the next match with this anchor, so C<\G> is similar to the
839beginning of string anchor, C<^>. The C<\G> anchor is typically
840used with the C<g> modifier. It uses the value of C<pos()>
841as the position to start the next match. As the match
842operator makes successive matches, it updates C<pos()> with the
843position of the next character past the last match (or the
844first character of the next match, depending on how you like
845to look at it). Each string has its own C<pos()> value.
846
847Suppose you want to match all of consecutive pairs of digits
848in a string like "1122a44" and stop matching when you
849encounter non-digits. You want to match C<11> and C<22> but
850the letter C<a> shows up between C<22> and C<44> and you want
851to stop at C<a>. Simply matching pairs of digits skips over
852the C<a> and still matches C<44>.
853
854    $_ = "1122a44";
855    my @pairs = m/(\d\d)/g;   # qw( 11 22 44 )
856
857If you use the C<\G> anchor, you force the match after C<22> to
858start with the C<a>. The regular expression cannot match
859there since it does not find a digit, so the next match
860fails and the match operator returns the pairs it already
861found.
862
863    $_ = "1122a44";
864    my @pairs = m/\G(\d\d)/g; # qw( 11 22 )
865
866You can also use the C<\G> anchor in scalar context. You
867still need the C<g> modifier.
868
869    $_ = "1122a44";
870    while( m/\G(\d\d)/g ) {
871        print "Found $1\n";
872    }
873
874After the match fails at the letter C<a>, perl resets C<pos()>
875and the next match on the same string starts at the beginning.
876
877    $_ = "1122a44";
878    while( m/\G(\d\d)/g ) {
879        print "Found $1\n";
880    }
881
882    print "Found $1 after while" if m/(\d\d)/g; # finds "11"
883
884You can disable C<pos()> resets on fail with the C<c> modifier, documented
885in L<perlop> and L<perlreref>. Subsequent matches start where the last
886successful match ended (the value of C<pos()>) even if a match on the
887same string has failed in the meantime. In this case, the match after
888the C<while()> loop starts at the C<a> (where the last match stopped),
889and since it does not use any anchor it can skip over the C<a> to find
890C<44>.
891
892    $_ = "1122a44";
893    while( m/\G(\d\d)/gc ) {
894        print "Found $1\n";
895    }
896
897    print "Found $1 after while" if m/(\d\d)/g; # finds "44"
898
899Typically you use the C<\G> anchor with the C<c> modifier
900when you want to try a different match if one fails,
901such as in a tokenizer. Jeffrey Friedl offers this example
902which works in 5.004 or later.
903
904    while (<>) {
905        chomp;
906        PARSER: {
907            m/ \G( \d+\b    )/gcx   && do { print "number: $1\n";  redo; };
908            m/ \G( \w+      )/gcx   && do { print "word:   $1\n";  redo; };
909            m/ \G( \s+      )/gcx   && do { print "space:  $1\n";  redo; };
910            m/ \G( [^\w\d]+ )/gcx   && do { print "other:  $1\n";  redo; };
911        }
912    }
913
914For each line, the C<PARSER> loop first tries to match a series
915of digits followed by a word boundary. This match has to
916start at the place the last match left off (or the beginning
917of the string on the first match). Since C<m/ \G( \d+\b
918)/gcx> uses the C<c> modifier, if the string does not match that
919regular expression, perl does not reset pos() and the next
920match starts at the same position to try a different
921pattern.
922
923=head2 Are Perl regexes DFAs or NFAs? Are they POSIX compliant?
924X<DFA> X<NFA> X<POSIX>
925
926While it's true that Perl's regular expressions resemble the DFAs
927(deterministic finite automata) of the egrep(1) program, they are in
928fact implemented as NFAs (non-deterministic finite automata) to allow
929backtracking and backreferencing. And they aren't POSIX-style either,
930because those guarantee worst-case behavior for all cases. (It seems
931that some people prefer guarantees of consistency, even when what's
932guaranteed is slowness.) See the book "Mastering Regular Expressions"
933(from O'Reilly) by Jeffrey Friedl for all the details you could ever
934hope to know on these matters (a full citation appears in
935L<perlfaq2>).
936
937=head2 What's wrong with using grep in a void context?
938X<grep>
939
940The problem is that grep builds a return list, regardless of the context.
941This means you're making Perl go to the trouble of building a list that
942you then just throw away. If the list is large, you waste both time and space.
943If your intent is to iterate over the list, then use a for loop for this
944purpose.
945
946In perls older than 5.8.1, map suffers from this problem as well.
947But since 5.8.1, this has been fixed, and map is context aware - in void
948context, no lists are constructed.
949
950=head2 How can I match strings with multibyte characters?
951X<regex, and multibyte characters> X<regexp, and multibyte characters>
952X<regular expression, and multibyte characters> X<martian> X<encoding, Martian>
953
954Starting from Perl 5.6 Perl has had some level of multibyte character
955support. Perl 5.8 or later is recommended. Supported multibyte
956character repertoires include Unicode, and legacy encodings
957through the Encode module. See L<perluniintro>, L<perlunicode>,
958and L<Encode>.
959
960If you are stuck with older Perls, you can do Unicode with the
961L<Unicode::String> module, and character conversions using the
962L<Unicode::Map8> and L<Unicode::Map> modules. If you are using
963Japanese encodings, you might try using the jperl 5.005_03.
964
965Finally, the following set of approaches was offered by Jeffrey
966Friedl, whose article in issue #5 of The Perl Journal talks about
967this very matter.
968
969Let's suppose you have some weird Martian encoding where pairs of
970ASCII uppercase letters encode single Martian letters (i.e. the two
971bytes "CV" make a single Martian letter, as do the two bytes "SG",
972"VS", "XX", etc.). Other bytes represent single characters, just like
973ASCII.
974
975So, the string of Martian "I am CVSGXX!" uses 12 bytes to encode the
976nine characters 'I', ' ', 'a', 'm', ' ', 'CV', 'SG', 'XX', '!'.
977
978Now, say you want to search for the single character C</GX/>. Perl
979doesn't know about Martian, so it'll find the two bytes "GX" in the "I
980am CVSGXX!" string, even though that character isn't there: it just
981looks like it is because "SG" is next to "XX", but there's no real
982"GX". This is a big problem.
983
984Here are a few ways, all painful, to deal with it:
985
986    # Make sure adjacent "martian" bytes are no longer adjacent.
987    $martian =~ s/([A-Z][A-Z])/ $1 /g;
988
989    print "found GX!\n" if $martian =~ /GX/;
990
991Or like this:
992
993    my @chars = $martian =~ m/([A-Z][A-Z]|[^A-Z])/g;
994    # above is conceptually similar to:     my @chars = $text =~ m/(.)/g;
995    #
996    foreach my $char (@chars) {
997        print "found GX!\n", last if $char eq 'GX';
998    }
999
1000Or like this:
1001
1002    while ($martian =~ m/\G([A-Z][A-Z]|.)/gs) {  # \G probably unneeded
1003        if ($1 eq 'GX') {
1004            print "found GX!\n";
1005            last;
1006        }
1007    }
1008
1009Here's another, slightly less painful, way to do it from Benjamin
1010Goldberg, who uses a zero-width negative look-behind assertion.
1011
1012    print "found GX!\n" if    $martian =~ m/
1013        (?<![A-Z])
1014        (?:[A-Z][A-Z])*?
1015        GX
1016        /x;
1017
1018This succeeds if the "martian" character GX is in the string, and fails
1019otherwise. If you don't like using (?<!), a zero-width negative
1020look-behind assertion, you can replace (?<![A-Z]) with (?:^|[^A-Z]).
1021
1022It does have the drawback of putting the wrong thing in $-[0] and $+[0],
1023but this usually can be worked around.
1024
1025=head2 How do I match a regular expression that's in a variable?
1026X<regex, in variable> X<eval> X<regex> X<quotemeta> X<\Q, regex>
1027X<\E, regex> X<qr//>
1028
1029(contributed by brian d foy)
1030
1031We don't have to hard-code patterns into the match operator (or
1032anything else that works with regular expressions). We can put the
1033pattern in a variable for later use.
1034
1035The match operator is a double quote context, so you can interpolate
1036your variable just like a double quoted string. In this case, you
1037read the regular expression as user input and store it in C<$regex>.
1038Once you have the pattern in C<$regex>, you use that variable in the
1039match operator.
1040
1041    chomp( my $regex = <STDIN> );
1042
1043    if( $string =~ m/$regex/ ) { ... }
1044
1045Any regular expression special characters in C<$regex> are still
1046special, and the pattern still has to be valid or Perl will complain.
1047For instance, in this pattern there is an unpaired parenthesis.
1048
1049    my $regex = "Unmatched ( paren";
1050
1051    "Two parens to bind them all" =~ m/$regex/;
1052
1053When Perl compiles the regular expression, it treats the parenthesis
1054as the start of a memory match. When it doesn't find the closing
1055parenthesis, it complains:
1056
1057    Unmatched ( in regex; marked by <-- HERE in m/Unmatched ( <-- HERE  paren/ at script line 3.
1058
1059You can get around this in several ways depending on our situation.
1060First, if you don't want any of the characters in the string to be
1061special, you can escape them with C<quotemeta> before you use the string.
1062
1063    chomp( my $regex = <STDIN> );
1064    $regex = quotemeta( $regex );
1065
1066    if( $string =~ m/$regex/ ) { ... }
1067
1068You can also do this directly in the match operator using the C<\Q>
1069and C<\E> sequences. The C<\Q> tells Perl where to start escaping
1070special characters, and the C<\E> tells it where to stop (see L<perlop>
1071for more details).
1072
1073    chomp( my $regex = <STDIN> );
1074
1075    if( $string =~ m/\Q$regex\E/ ) { ... }
1076
1077Alternately, you can use C<qr//>, the regular expression quote operator (see
1078L<perlop> for more details). It quotes and perhaps compiles the pattern,
1079and you can apply regular expression flags to the pattern.
1080
1081    chomp( my $input = <STDIN> );
1082
1083    my $regex = qr/$input/is;
1084
1085    $string =~ m/$regex/  # same as m/$input/is;
1086
1087You might also want to trap any errors by wrapping an C<eval> block
1088around the whole thing.
1089
1090    chomp( my $input = <STDIN> );
1091
1092    eval {
1093        if( $string =~ m/\Q$input\E/ ) { ... }
1094    };
1095    warn $@ if $@;
1096
1097Or...
1098
1099    my $regex = eval { qr/$input/is };
1100    if( defined $regex ) {
1101        $string =~ m/$regex/;
1102    }
1103    else {
1104        warn $@;
1105    }
1106
1107=head1 AUTHOR AND COPYRIGHT
1108
1109Copyright (c) 1997-2010 Tom Christiansen, Nathan Torkington, and
1110other authors as noted. All rights reserved.
1111
1112This documentation is free; you can redistribute it and/or modify it
1113under the same terms as Perl itself.
1114
1115Irrespective of its distribution, all code examples in this file
1116are hereby placed into the public domain. You are permitted and
1117encouraged to use this code in your own programs for fun
1118or for profit as you see fit. A simple comment in the code giving
1119credit would be courteous but is not required.
1120