1=head1 NAME
2
3perlfaq6 - Regular Expressions
4
5=head1 VERSION
6
7version 5.20200523
8
9=head1 DESCRIPTION
10
11This section is surprisingly small because the rest of the FAQ is
12littered with answers involving regular expressions. For example,
13decoding a URL and checking whether something is a number can be handled
14with regular expressions, but those answers are found elsewhere in
15this document (in L<perlfaq9>: "How do I decode or create those %-encodings
16on the web" and L<perlfaq4>: "How do I determine whether a scalar is
17a number/whole/integer/float", to be precise).
18
19=head2 How can I hope to use regular expressions without creating illegible and unmaintainable code?
20X<regex, legibility> X<regexp, legibility>
21X<regular expression, legibility> X</x>
22
23Three techniques can make regular expressions maintainable and
24understandable.
25
26=over 4
27
28=item Comments Outside the Regex
29
30Describe what you're doing and how you're doing it, using normal Perl
31comments.
32
33    # turn the line into the first word, a colon, and the
34    # number of characters on the rest of the line
35    s/^(\w+)(.*)/ lc($1) . ":" . length($2) /meg;
36
37=item Comments Inside the Regex
38
39The C</x> modifier causes whitespace to be ignored in a regex pattern
40(except in a character class and a few other places), and also allows you to
41use normal comments there, too. As you can imagine, whitespace and comments
42help a lot.
43
44C</x> lets you turn this:
45
46    s{<(?:[^>'"]*|".*?"|'.*?')+>}{}gs;
47
48into this:
49
50    s{ <                    # opening angle bracket
51        (?:                 # Non-backreffing grouping paren
52            [^>'"] *        # 0 or more things that are neither > nor ' nor "
53                |           #    or else
54            ".*?"           # a section between double quotes (stingy match)
55                |           #    or else
56            '.*?'           # a section between single quotes (stingy match)
57        ) +                 #   all occurring one or more times
58        >                   # closing angle bracket
59    }{}gsx;                 # replace with nothing, i.e. delete
60
61It's still not quite so clear as prose, but it is very useful for
62describing the meaning of each part of the pattern.
63
64=item Different Delimiters
65
66While we normally think of patterns as being delimited with C</>
67characters, they can be delimited by almost any character. L<perlre>
68describes this. For example, the C<s///> above uses braces as
69delimiters. Selecting another delimiter can avoid quoting the
70delimiter within the pattern:
71
72    s/\/usr\/local/\/usr\/share/g;    # bad delimiter choice
73    s#/usr/local#/usr/share#g;        # better
74
75Using logically paired delimiters can be even more readable:
76
77    s{/usr/local/}{/usr/share}g;      # better still
78
79=back
80
81=head2 I'm having trouble matching over more than one line. What's wrong?
82X<regex, multiline> X<regexp, multiline> X<regular expression, multiline>
83
84Either you don't have more than one line in the string you're looking
85at (probably), or else you aren't using the correct modifier(s) on
86your pattern (possibly).
87
88There are many ways to get multiline data into a string. If you want
89it to happen automatically while reading input, you'll want to set $/
90(probably to '' for paragraphs or C<undef> for the whole file) to
91allow you to read more than one line at a time.
92
93Read L<perlre> to help you decide which of C</s> and C</m> (or both)
94you might want to use: C</s> allows dot to include newline, and C</m>
95allows caret and dollar to match next to a newline, not just at the
96end of the string. You do need to make sure that you've actually
97got a multiline string in there.
98
99For example, this program detects duplicate words, even when they span
100line breaks (but not paragraph ones). For this example, we don't need
101C</s> because we aren't using dot in a regular expression that we want
102to cross line boundaries. Neither do we need C</m> because we don't
103want caret or dollar to match at any point inside the record next
104to newlines. But it's imperative that $/ be set to something other
105than the default, or else we won't actually ever have a multiline
106record read in.
107
108    $/ = '';          # read in whole paragraph, not just one line
109    while ( <> ) {
110        while ( /\b([\w'-]+)(\s+\g1)+\b/gi ) {     # word starts alpha
111            print "Duplicate $1 at paragraph $.\n";
112        }
113    }
114
115Here's some code that finds sentences that begin with "From " (which would
116be mangled by many mailers):
117
118    $/ = '';          # read in whole paragraph, not just one line
119    while ( <> ) {
120        while ( /^From /gm ) { # /m makes ^ match next to \n
121        print "leading From in paragraph $.\n";
122        }
123    }
124
125Here's code that finds everything between START and END in a paragraph:
126
127    undef $/;          # read in whole file, not just one line or paragraph
128    while ( <> ) {
129        while ( /START(.*?)END/sgm ) { # /s makes . cross line boundaries
130            print "$1\n";
131        }
132    }
133
134=head2 How can I pull out lines between two patterns that are themselves on different lines?
135X<..>
136
137You can use Perl's somewhat exotic C<..> operator (documented in
138L<perlop>):
139
140    perl -ne 'print if /START/ .. /END/' file1 file2 ...
141
142If you wanted text and not lines, you would use
143
144    perl -0777 -ne 'print "$1\n" while /START(.*?)END/gs' file1 file2 ...
145
146But if you want nested occurrences of C<START> through C<END>, you'll
147run up against the problem described in the question in this section
148on matching balanced text.
149
150Here's another example of using C<..>:
151
152    while (<>) {
153        my $in_header =   1  .. /^$/;
154        my $in_body   = /^$/ .. eof;
155    # now choose between them
156    } continue {
157        $. = 0 if eof;    # fix $.
158    }
159
160=head2 How do I match XML, HTML, or other nasty, ugly things with a regex?
161X<regex, XML> X<regex, HTML> X<XML> X<HTML> X<pain> X<frustration>
162X<sucking out, will to live>
163
164Do not use regexes. Use a module and forget about the
165regular expressions. The L<XML::LibXML>, L<HTML::TokeParser> and
166L<HTML::TreeBuilder> modules are good starts, although each namespace
167has other parsing modules specialized for certain tasks and different
168ways of doing it. Start at CPAN Search ( L<http://metacpan.org/> )
169and wonder at all the work people have done for you already! :)
170
171=head2 I put a regular expression into $/ but it didn't work. What's wrong?
172X<$/, regexes in> X<$INPUT_RECORD_SEPARATOR, regexes in>
173X<$RS, regexes in>
174
175$/ has to be a string. You can use these examples if you really need to
176do this.
177
178If you have L<File::Stream>, this is easy.
179
180    use File::Stream;
181
182    my $stream = File::Stream->new(
183        $filehandle,
184        separator => qr/\s*,\s*/,
185        );
186
187    print "$_\n" while <$stream>;
188
189If you don't have File::Stream, you have to do a little more work.
190
191You can use the four-argument form of sysread to continually add to
192a buffer. After you add to the buffer, you check if you have a
193complete line (using your regular expression).
194
195    local $_ = "";
196    while( sysread FH, $_, 8192, length ) {
197        while( s/^((?s).*?)your_pattern// ) {
198            my $record = $1;
199            # do stuff here.
200        }
201    }
202
203You can do the same thing with foreach and a match using the
204c flag and the \G anchor, if you do not mind your entire file
205being in memory at the end.
206
207    local $_ = "";
208    while( sysread FH, $_, 8192, length ) {
209        foreach my $record ( m/\G((?s).*?)your_pattern/gc ) {
210            # do stuff here.
211        }
212        substr( $_, 0, pos ) = "" if pos;
213    }
214
215
216=head2 How do I substitute case-insensitively on the LHS while preserving case on the RHS?
217X<replace, case preserving> X<substitute, case preserving>
218X<substitution, case preserving> X<s, case preserving>
219
220Here's a lovely Perlish solution by Larry Rosler. It exploits
221properties of bitwise xor on ASCII strings.
222
223    $_= "this is a TEsT case";
224
225    $old = 'test';
226    $new = 'success';
227
228    s{(\Q$old\E)}
229    { uc $new | (uc $1 ^ $1) .
230        (uc(substr $1, -1) ^ substr $1, -1) x
231        (length($new) - length $1)
232    }egi;
233
234    print;
235
236And here it is as a subroutine, modeled after the above:
237
238    sub preserve_case {
239        my ($old, $new) = @_;
240        my $mask = uc $old ^ $old;
241
242        uc $new | $mask .
243            substr($mask, -1) x (length($new) - length($old))
244    }
245
246    $string = "this is a TEsT case";
247    $string =~ s/(test)/preserve_case($1, "success")/egi;
248    print "$string\n";
249
250This prints:
251
252    this is a SUcCESS case
253
254As an alternative, to keep the case of the replacement word if it is
255longer than the original, you can use this code, by Jeff Pinyan:
256
257    sub preserve_case {
258        my ($from, $to) = @_;
259        my ($lf, $lt) = map length, @_;
260
261        if ($lt < $lf) { $from = substr $from, 0, $lt }
262        else { $from .= substr $to, $lf }
263
264        return uc $to | ($from ^ uc $from);
265    }
266
267This changes the sentence to "this is a SUcCess case."
268
269Just to show that C programmers can write C in any programming language,
270if you prefer a more C-like solution, the following script makes the
271substitution have the same case, letter by letter, as the original.
272(It also happens to run about 240% slower than the Perlish solution runs.)
273If the substitution has more characters than the string being substituted,
274the case of the last character is used for the rest of the substitution.
275
276    # Original by Nathan Torkington, massaged by Jeffrey Friedl
277    #
278    sub preserve_case
279    {
280        my ($old, $new) = @_;
281        my $state = 0; # 0 = no change; 1 = lc; 2 = uc
282        my ($i, $oldlen, $newlen, $c) = (0, length($old), length($new));
283        my $len = $oldlen < $newlen ? $oldlen : $newlen;
284
285        for ($i = 0; $i < $len; $i++) {
286            if ($c = substr($old, $i, 1), $c =~ /[\W\d_]/) {
287                $state = 0;
288            } elsif (lc $c eq $c) {
289                substr($new, $i, 1) = lc(substr($new, $i, 1));
290                $state = 1;
291            } else {
292                substr($new, $i, 1) = uc(substr($new, $i, 1));
293                $state = 2;
294            }
295        }
296        # finish up with any remaining new (for when new is longer than old)
297        if ($newlen > $oldlen) {
298            if ($state == 1) {
299                substr($new, $oldlen) = lc(substr($new, $oldlen));
300            } elsif ($state == 2) {
301                substr($new, $oldlen) = uc(substr($new, $oldlen));
302            }
303        }
304        return $new;
305    }
306
307=head2 How can I make C<\w> match national character sets?
308X<\w>
309
310Put C<use locale;> in your script. The \w character class is taken
311from the current locale.
312
313See L<perllocale> for details.
314
315=head2 How can I match a locale-smart version of C</[a-zA-Z]/>?
316X<alpha>
317
318You can use the POSIX character class syntax C</[[:alpha:]]/>
319documented in L<perlre>.
320
321No matter which locale you are in, the alphabetic characters are
322the characters in \w without the digits and the underscore.
323As a regex, that looks like C</[^\W\d_]/>. Its complement,
324the non-alphabetics, is then everything in \W along with
325the digits and the underscore, or C</[\W\d_]/>.
326
327=head2 How can I quote a variable to use in a regex?
328X<regex, escaping> X<regexp, escaping> X<regular expression, escaping>
329
330The Perl parser will expand $variable and @variable references in
331regular expressions unless the delimiter is a single quote. Remember,
332too, that the right-hand side of a C<s///> substitution is considered
333a double-quoted string (see L<perlop> for more details). Remember
334also that any regex special characters will be acted on unless you
335precede the substitution with \Q. Here's an example:
336
337    $string = "Placido P. Octopus";
338    $regex  = "P.";
339
340    $string =~ s/$regex/Polyp/;
341    # $string is now "Polypacido P. Octopus"
342
343Because C<.> is special in regular expressions, and can match any
344single character, the regex C<P.> here has matched the <Pl> in the
345original string.
346
347To escape the special meaning of C<.>, we use C<\Q>:
348
349    $string = "Placido P. Octopus";
350    $regex  = "P.";
351
352    $string =~ s/\Q$regex/Polyp/;
353    # $string is now "Placido Polyp Octopus"
354
355The use of C<\Q> causes the C<.> in the regex to be treated as a
356regular character, so that C<P.> matches a C<P> followed by a dot.
357
358=head2 What is C</o> really for?
359X</o, regular expressions> X<compile, regular expressions>
360
361(contributed by brian d foy)
362
363The C</o> option for regular expressions (documented in L<perlop> and
364L<perlreref>) tells Perl to compile the regular expression only once.
365This is only useful when the pattern contains a variable. Perls 5.6
366and later handle this automatically if the pattern does not change.
367
368Since the match operator C<m//>, the substitution operator C<s///>,
369and the regular expression quoting operator C<qr//> are double-quotish
370constructs, you can interpolate variables into the pattern. See the
371answer to "How can I quote a variable to use in a regex?" for more
372details.
373
374This example takes a regular expression from the argument list and
375prints the lines of input that match it:
376
377    my $pattern = shift @ARGV;
378
379    while( <> ) {
380        print if m/$pattern/;
381    }
382
383Versions of Perl prior to 5.6 would recompile the regular expression
384for each iteration, even if C<$pattern> had not changed. The C</o>
385would prevent this by telling Perl to compile the pattern the first
386time, then reuse that for subsequent iterations:
387
388    my $pattern = shift @ARGV;
389
390    while( <> ) {
391        print if m/$pattern/o; # useful for Perl < 5.6
392    }
393
394In versions 5.6 and later, Perl won't recompile the regular expression
395if the variable hasn't changed, so you probably don't need the C</o>
396option. It doesn't hurt, but it doesn't help either. If you want any
397version of Perl to compile the regular expression only once even if
398the variable changes (thus, only using its initial value), you still
399need the C</o>.
400
401You can watch Perl's regular expression engine at work to verify for
402yourself if Perl is recompiling a regular expression. The C<use re
403'debug'> pragma (comes with Perl 5.005 and later) shows the details.
404With Perls before 5.6, you should see C<re> reporting that its
405compiling the regular expression on each iteration. With Perl 5.6 or
406later, you should only see C<re> report that for the first iteration.
407
408    use re 'debug';
409
410    my $regex = 'Perl';
411    foreach ( qw(Perl Java Ruby Python) ) {
412        print STDERR "-" x 73, "\n";
413        print STDERR "Trying $_...\n";
414        print STDERR "\t$_ is good!\n" if m/$regex/;
415    }
416
417=head2 How do I use a regular expression to strip C-style comments from a file?
418
419While this actually can be done, it's much harder than you'd think.
420For example, this one-liner
421
422    perl -0777 -pe 's{/\*.*?\*/}{}gs' foo.c
423
424will work in many but not all cases. You see, it's too simple-minded for
425certain kinds of C programs, in particular, those with what appear to be
426comments in quoted strings. For that, you'd need something like this,
427created by Jeffrey Friedl and later modified by Fred Curtis.
428
429    $/ = undef;
430    $_ = <>;
431    s#/\*[^*]*\*+([^/*][^*]*\*+)*/|("(\\.|[^"\\])*"|'(\\.|[^'\\])*'|.[^/"'\\]*)#defined $2 ? $2 : ""#gse;
432    print;
433
434This could, of course, be more legibly written with the C</x> modifier, adding
435whitespace and comments. Here it is expanded, courtesy of Fred Curtis.
436
437    s{
438       /\*         ##  Start of /* ... */ comment
439       [^*]*\*+    ##  Non-* followed by 1-or-more *'s
440       (
441         [^/*][^*]*\*+
442       )*          ##  0-or-more things which don't start with /
443                   ##    but do end with '*'
444       /           ##  End of /* ... */ comment
445
446     |         ##     OR  various things which aren't comments:
447
448       (
449         "           ##  Start of " ... " string
450         (
451           \\.           ##  Escaped char
452         |               ##    OR
453           [^"\\]        ##  Non "\
454         )*
455         "           ##  End of " ... " string
456
457       |         ##     OR
458
459         '           ##  Start of ' ... ' string
460         (
461           \\.           ##  Escaped char
462         |               ##    OR
463           [^'\\]        ##  Non '\
464         )*
465         '           ##  End of ' ... ' string
466
467       |         ##     OR
468
469         .           ##  Anything other char
470         [^/"'\\]*   ##  Chars which doesn't start a comment, string or escape
471       )
472     }{defined $2 ? $2 : ""}gxse;
473
474A slight modification also removes C++ comments, possibly spanning multiple lines
475using a continuation character:
476
477 s#/\*[^*]*\*+([^/*][^*]*\*+)*/|//([^\\]|[^\n][\n]?)*?\n|("(\\.|[^"\\])*"|'(\\.|[^'\\])*'|.[^/"'\\]*)#defined $3 ? $3 : ""#gse;
478
479=head2 Can I use Perl regular expressions to match balanced text?
480X<regex, matching balanced test> X<regexp, matching balanced test>
481X<regular expression, matching balanced test> X<possessive> X<PARNO>
482X<Text::Balanced> X<Regexp::Common> X<backtracking> X<recursion>
483
484(contributed by brian d foy)
485
486Your first try should probably be the L<Text::Balanced> module, which
487is in the Perl standard library since Perl 5.8. It has a variety of
488functions to deal with tricky text. The L<Regexp::Common> module can
489also help by providing canned patterns you can use.
490
491As of Perl 5.10, you can match balanced text with regular expressions
492using recursive patterns. Before Perl 5.10, you had to resort to
493various tricks such as using Perl code in C<(??{})> sequences.
494
495Here's an example using a recursive regular expression. The goal is to
496capture all of the text within angle brackets, including the text in
497nested angle brackets. This sample text has two "major" groups: a
498group with one level of nesting and a group with two levels of
499nesting. There are five total groups in angle brackets:
500
501    I have some <brackets in <nested brackets> > and
502    <another group <nested once <nested twice> > >
503    and that's it.
504
505The regular expression to match the balanced text uses two new (to
506Perl 5.10) regular expression features. These are covered in L<perlre>
507and this example is a modified version of one in that documentation.
508
509First, adding the new possessive C<+> to any quantifier finds the
510longest match and does not backtrack. That's important since you want
511to handle any angle brackets through the recursion, not backtracking.
512The group C<< [^<>]++ >> finds one or more non-angle brackets without
513backtracking.
514
515Second, the new C<(?PARNO)> refers to the sub-pattern in the
516particular capture group given by C<PARNO>. In the following regex,
517the first capture group finds (and remembers) the balanced text, and
518you need that same pattern within the first buffer to get past the
519nested text. That's the recursive part. The C<(?1)> uses the pattern
520in the outer capture group as an independent part of the regex.
521
522Putting it all together, you have:
523
524    #!/usr/local/bin/perl5.10.0
525
526    my $string =<<"HERE";
527    I have some <brackets in <nested brackets> > and
528    <another group <nested once <nested twice> > >
529    and that's it.
530    HERE
531
532    my @groups = $string =~ m/
533            (                   # start of capture group 1
534            <                   # match an opening angle bracket
535                (?:
536                    [^<>]++     # one or more non angle brackets, non backtracking
537                      |
538                    (?1)        # found < or >, so recurse to capture group 1
539                )*
540            >                   # match a closing angle bracket
541            )                   # end of capture group 1
542            /xg;
543
544    $" = "\n\t";
545    print "Found:\n\t@groups\n";
546
547The output shows that Perl found the two major groups:
548
549    Found:
550        <brackets in <nested brackets> >
551        <another group <nested once <nested twice> > >
552
553With a little extra work, you can get all of the groups in angle
554brackets even if they are in other angle brackets too. Each time you
555get a balanced match, remove its outer delimiter (that's the one you
556just matched so don't match it again) and add it to a queue of strings
557to process. Keep doing that until you get no matches:
558
559    #!/usr/local/bin/perl5.10.0
560
561    my @queue =<<"HERE";
562    I have some <brackets in <nested brackets> > and
563    <another group <nested once <nested twice> > >
564    and that's it.
565    HERE
566
567    my $regex = qr/
568            (                   # start of bracket 1
569            <                   # match an opening angle bracket
570                (?:
571                    [^<>]++     # one or more non angle brackets, non backtracking
572                      |
573                    (?1)        # recurse to bracket 1
574                )*
575            >                   # match a closing angle bracket
576            )                   # end of bracket 1
577            /x;
578
579    $" = "\n\t";
580
581    while( @queue ) {
582        my $string = shift @queue;
583
584        my @groups = $string =~ m/$regex/g;
585        print "Found:\n\t@groups\n\n" if @groups;
586
587        unshift @queue, map { s/^<//; s/>$//; $_ } @groups;
588    }
589
590The output shows all of the groups. The outermost matches show up
591first and the nested matches show up later:
592
593    Found:
594        <brackets in <nested brackets> >
595        <another group <nested once <nested twice> > >
596
597    Found:
598        <nested brackets>
599
600    Found:
601        <nested once <nested twice> >
602
603    Found:
604        <nested twice>
605
606=head2 What does it mean that regexes are greedy? How can I get around it?
607X<greedy> X<greediness>
608
609Most people mean that greedy regexes match as much as they can.
610Technically speaking, it's actually the quantifiers (C<?>, C<*>, C<+>,
611C<{}>) that are greedy rather than the whole pattern; Perl prefers local
612greed and immediate gratification to overall greed. To get non-greedy
613versions of the same quantifiers, use (C<??>, C<*?>, C<+?>, C<{}?>).
614
615An example:
616
617    my $s1 = my $s2 = "I am very very cold";
618    $s1 =~ s/ve.*y //;      # I am cold
619    $s2 =~ s/ve.*?y //;     # I am very cold
620
621Notice how the second substitution stopped matching as soon as it
622encountered "y ". The C<*?> quantifier effectively tells the regular
623expression engine to find a match as quickly as possible and pass
624control on to whatever is next in line, as you would if you were
625playing hot potato.
626
627=head2 How do I process each word on each line?
628X<word>
629
630Use the split function:
631
632    while (<>) {
633        foreach my $word ( split ) {
634            # do something with $word here
635        }
636    }
637
638Note that this isn't really a word in the English sense; it's just
639chunks of consecutive non-whitespace characters.
640
641To work with only alphanumeric sequences (including underscores), you
642might consider
643
644    while (<>) {
645        foreach $word (m/(\w+)/g) {
646            # do something with $word here
647        }
648    }
649
650=head2 How can I print out a word-frequency or line-frequency summary?
651
652To do this, you have to parse out each word in the input stream. We'll
653pretend that by word you mean chunk of alphabetics, hyphens, or
654apostrophes, rather than the non-whitespace chunk idea of a word given
655in the previous question:
656
657    my (%seen);
658    while (<>) {
659        while ( /(\b[^\W_\d][\w'-]+\b)/g ) {   # misses "`sheep'"
660            $seen{$1}++;
661        }
662    }
663
664    while ( my ($word, $count) = each %seen ) {
665        print "$count $word\n";
666    }
667
668If you wanted to do the same thing for lines, you wouldn't need a
669regular expression:
670
671    my (%seen);
672
673    while (<>) {
674        $seen{$_}++;
675    }
676
677    while ( my ($line, $count) = each %seen ) {
678        print "$count $line";
679    }
680
681If you want these output in a sorted order, see L<perlfaq4>: "How do I
682sort a hash (optionally by value instead of key)?".
683
684=head2 How can I do approximate matching?
685X<match, approximate> X<matching, approximate>
686
687See the module L<String::Approx> available from CPAN.
688
689=head2 How do I efficiently match many regular expressions at once?
690X<regex, efficiency> X<regexp, efficiency>
691X<regular expression, efficiency>
692
693(contributed by brian d foy)
694
695If you have Perl 5.10 or later, this is almost trivial. You just smart
696match against an array of regular expression objects:
697
698    my @patterns = ( qr/Fr.d/, qr/B.rn.y/, qr/W.lm./ );
699
700    if( $string ~~ @patterns ) {
701        ...
702    };
703
704The smart match stops when it finds a match, so it doesn't have to try
705every expression.
706
707Earlier than Perl 5.10, you have a bit of work to do. You want to
708avoid compiling a regular expression every time you want to match it.
709In this example, perl must recompile the regular expression for every
710iteration of the C<foreach> loop since it has no way to know what
711C<$pattern> will be:
712
713    my @patterns = qw( foo bar baz );
714
715    LINE: while( <DATA> ) {
716        foreach $pattern ( @patterns ) {
717            if( /\b$pattern\b/i ) {
718                print;
719                next LINE;
720            }
721        }
722    }
723
724The C<qr//> operator showed up in perl 5.005. It compiles a regular
725expression, but doesn't apply it. When you use the pre-compiled
726version of the regex, perl does less work. In this example, I inserted
727a C<map> to turn each pattern into its pre-compiled form. The rest of
728the script is the same, but faster:
729
730    my @patterns = map { qr/\b$_\b/i } qw( foo bar baz );
731
732    LINE: while( <> ) {
733        foreach $pattern ( @patterns ) {
734            if( /$pattern/ ) {
735                print;
736                next LINE;
737            }
738        }
739    }
740
741In some cases, you may be able to make several patterns into a single
742regular expression. Beware of situations that require backtracking
743though.
744
745    my $regex = join '|', qw( foo bar baz );
746
747    LINE: while( <> ) {
748        print if /\b(?:$regex)\b/i;
749    }
750
751For more details on regular expression efficiency, see I<Mastering
752Regular Expressions> by Jeffrey Friedl. He explains how the regular
753expressions engine works and why some patterns are surprisingly
754inefficient. Once you understand how perl applies regular expressions,
755you can tune them for individual situations.
756
757=head2 Why don't word-boundary searches with C<\b> work for me?
758X<\b>
759
760(contributed by brian d foy)
761
762Ensure that you know what \b really does: it's the boundary between a
763word character, \w, and something that isn't a word character. That
764thing that isn't a word character might be \W, but it can also be the
765start or end of the string.
766
767It's not (not!) the boundary between whitespace and non-whitespace,
768and it's not the stuff between words we use to create sentences.
769
770In regex speak, a word boundary (\b) is a "zero width assertion",
771meaning that it doesn't represent a character in the string, but a
772condition at a certain position.
773
774For the regular expression, /\bPerl\b/, there has to be a word
775boundary before the "P" and after the "l". As long as something other
776than a word character precedes the "P" and succeeds the "l", the
777pattern will match. These strings match /\bPerl\b/.
778
779    "Perl"    # no word char before "P" or after "l"
780    "Perl "   # same as previous (space is not a word char)
781    "'Perl'"  # the "'" char is not a word char
782    "Perl's"  # no word char before "P", non-word char after "l"
783
784These strings do not match /\bPerl\b/.
785
786    "Perl_"   # "_" is a word char!
787    "Perler"  # no word char before "P", but one after "l"
788
789You don't have to use \b to match words though. You can look for
790non-word characters surrounded by word characters. These strings
791match the pattern /\b'\b/.
792
793    "don't"   # the "'" char is surrounded by "n" and "t"
794    "qep'a'"  # the "'" char is surrounded by "p" and "a"
795
796These strings do not match /\b'\b/.
797
798    "foo'"    # there is no word char after non-word "'"
799
800You can also use the complement of \b, \B, to specify that there
801should not be a word boundary.
802
803In the pattern /\Bam\B/, there must be a word character before the "a"
804and after the "m". These patterns match /\Bam\B/:
805
806    "llama"   # "am" surrounded by word chars
807    "Samuel"  # same
808
809These strings do not match /\Bam\B/
810
811    "Sam"      # no word boundary before "a", but one after "m"
812    "I am Sam" # "am" surrounded by non-word chars
813
814
815=head2 Why does using $&, $`, or $' slow my program down?
816X<$MATCH> X<$&> X<$POSTMATCH> X<$'> X<$PREMATCH> X<$`>
817
818(contributed by Anno Siegel)
819
820Once Perl sees that you need one of these variables anywhere in the
821program, it provides them on each and every pattern match. That means
822that on every pattern match the entire string will be copied, part of it
823to $`, part to $&, and part to $'. Thus the penalty is most severe with
824long strings and patterns that match often. Avoid $&, $', and $` if you
825can, but if you can't, once you've used them at all, use them at will
826because you've already paid the price. Remember that some algorithms
827really appreciate them. As of the 5.005 release, the $& variable is no
828longer "expensive" the way the other two are.
829
830Since Perl 5.6.1 the special variables @- and @+ can functionally replace
831$`, $& and $'. These arrays contain pointers to the beginning and end
832of each match (see perlvar for the full story), so they give you
833essentially the same information, but without the risk of excessive
834string copying.
835
836Perl 5.10 added three specials, C<${^MATCH}>, C<${^PREMATCH}>, and
837C<${^POSTMATCH}> to do the same job but without the global performance
838penalty. Perl 5.10 only sets these variables if you compile or execute the
839regular expression with the C</p> modifier.
840
841=head2 What good is C<\G> in a regular expression?
842X<\G>
843
844You use the C<\G> anchor to start the next match on the same
845string where the last match left off. The regular
846expression engine cannot skip over any characters to find
847the next match with this anchor, so C<\G> is similar to the
848beginning of string anchor, C<^>. The C<\G> anchor is typically
849used with the C<g> modifier. It uses the value of C<pos()>
850as the position to start the next match. As the match
851operator makes successive matches, it updates C<pos()> with the
852position of the next character past the last match (or the
853first character of the next match, depending on how you like
854to look at it). Each string has its own C<pos()> value.
855
856Suppose you want to match all of consecutive pairs of digits
857in a string like "1122a44" and stop matching when you
858encounter non-digits. You want to match C<11> and C<22> but
859the letter C<a> shows up between C<22> and C<44> and you want
860to stop at C<a>. Simply matching pairs of digits skips over
861the C<a> and still matches C<44>.
862
863    $_ = "1122a44";
864    my @pairs = m/(\d\d)/g;   # qw( 11 22 44 )
865
866If you use the C<\G> anchor, you force the match after C<22> to
867start with the C<a>. The regular expression cannot match
868there since it does not find a digit, so the next match
869fails and the match operator returns the pairs it already
870found.
871
872    $_ = "1122a44";
873    my @pairs = m/\G(\d\d)/g; # qw( 11 22 )
874
875You can also use the C<\G> anchor in scalar context. You
876still need the C<g> modifier.
877
878    $_ = "1122a44";
879    while( m/\G(\d\d)/g ) {
880        print "Found $1\n";
881    }
882
883After the match fails at the letter C<a>, perl resets C<pos()>
884and the next match on the same string starts at the beginning.
885
886    $_ = "1122a44";
887    while( m/\G(\d\d)/g ) {
888        print "Found $1\n";
889    }
890
891    print "Found $1 after while" if m/(\d\d)/g; # finds "11"
892
893You can disable C<pos()> resets on fail with the C<c> modifier, documented
894in L<perlop> and L<perlreref>. Subsequent matches start where the last
895successful match ended (the value of C<pos()>) even if a match on the
896same string has failed in the meantime. In this case, the match after
897the C<while()> loop starts at the C<a> (where the last match stopped),
898and since it does not use any anchor it can skip over the C<a> to find
899C<44>.
900
901    $_ = "1122a44";
902    while( m/\G(\d\d)/gc ) {
903        print "Found $1\n";
904    }
905
906    print "Found $1 after while" if m/(\d\d)/g; # finds "44"
907
908Typically you use the C<\G> anchor with the C<c> modifier
909when you want to try a different match if one fails,
910such as in a tokenizer. Jeffrey Friedl offers this example
911which works in 5.004 or later.
912
913    while (<>) {
914        chomp;
915        PARSER: {
916            m/ \G( \d+\b    )/gcx   && do { print "number: $1\n";  redo; };
917            m/ \G( \w+      )/gcx   && do { print "word:   $1\n";  redo; };
918            m/ \G( \s+      )/gcx   && do { print "space:  $1\n";  redo; };
919            m/ \G( [^\w\d]+ )/gcx   && do { print "other:  $1\n";  redo; };
920        }
921    }
922
923For each line, the C<PARSER> loop first tries to match a series
924of digits followed by a word boundary. This match has to
925start at the place the last match left off (or the beginning
926of the string on the first match). Since C<m/ \G( \d+\b
927)/gcx> uses the C<c> modifier, if the string does not match that
928regular expression, perl does not reset pos() and the next
929match starts at the same position to try a different
930pattern.
931
932=head2 Are Perl regexes DFAs or NFAs? Are they POSIX compliant?
933X<DFA> X<NFA> X<POSIX>
934
935While it's true that Perl's regular expressions resemble the DFAs
936(deterministic finite automata) of the egrep(1) program, they are in
937fact implemented as NFAs (non-deterministic finite automata) to allow
938backtracking and backreferencing. And they aren't POSIX-style either,
939because those guarantee worst-case behavior for all cases. (It seems
940that some people prefer guarantees of consistency, even when what's
941guaranteed is slowness.) See the book "Mastering Regular Expressions"
942(from O'Reilly) by Jeffrey Friedl for all the details you could ever
943hope to know on these matters (a full citation appears in
944L<perlfaq2>).
945
946=head2 What's wrong with using grep in a void context?
947X<grep>
948
949The problem is that grep builds a return list, regardless of the context.
950This means you're making Perl go to the trouble of building a list that
951you then just throw away. If the list is large, you waste both time and space.
952If your intent is to iterate over the list, then use a for loop for this
953purpose.
954
955In perls older than 5.8.1, map suffers from this problem as well.
956But since 5.8.1, this has been fixed, and map is context aware - in void
957context, no lists are constructed.
958
959=head2 How can I match strings with multibyte characters?
960X<regex, and multibyte characters> X<regexp, and multibyte characters>
961X<regular expression, and multibyte characters> X<martian> X<encoding, Martian>
962
963Starting from Perl 5.6 Perl has had some level of multibyte character
964support. Perl 5.8 or later is recommended. Supported multibyte
965character repertoires include Unicode, and legacy encodings
966through the Encode module. See L<perluniintro>, L<perlunicode>,
967and L<Encode>.
968
969If you are stuck with older Perls, you can do Unicode with the
970L<Unicode::String> module, and character conversions using the
971L<Unicode::Map8> and L<Unicode::Map> modules. If you are using
972Japanese encodings, you might try using the jperl 5.005_03.
973
974Finally, the following set of approaches was offered by Jeffrey
975Friedl, whose article in issue #5 of The Perl Journal talks about
976this very matter.
977
978Let's suppose you have some weird Martian encoding where pairs of
979ASCII uppercase letters encode single Martian letters (i.e. the two
980bytes "CV" make a single Martian letter, as do the two bytes "SG",
981"VS", "XX", etc.). Other bytes represent single characters, just like
982ASCII.
983
984So, the string of Martian "I am CVSGXX!" uses 12 bytes to encode the
985nine characters 'I', ' ', 'a', 'm', ' ', 'CV', 'SG', 'XX', '!'.
986
987Now, say you want to search for the single character C</GX/>. Perl
988doesn't know about Martian, so it'll find the two bytes "GX" in the "I
989am CVSGXX!" string, even though that character isn't there: it just
990looks like it is because "SG" is next to "XX", but there's no real
991"GX". This is a big problem.
992
993Here are a few ways, all painful, to deal with it:
994
995    # Make sure adjacent "martian" bytes are no longer adjacent.
996    $martian =~ s/([A-Z][A-Z])/ $1 /g;
997
998    print "found GX!\n" if $martian =~ /GX/;
999
1000Or like this:
1001
1002    my @chars = $martian =~ m/([A-Z][A-Z]|[^A-Z])/g;
1003    # above is conceptually similar to:     my @chars = $text =~ m/(.)/g;
1004    #
1005    foreach my $char (@chars) {
1006        print "found GX!\n", last if $char eq 'GX';
1007    }
1008
1009Or like this:
1010
1011    while ($martian =~ m/\G([A-Z][A-Z]|.)/gs) {  # \G probably unneeded
1012        if ($1 eq 'GX') {
1013            print "found GX!\n";
1014            last;
1015        }
1016    }
1017
1018Here's another, slightly less painful, way to do it from Benjamin
1019Goldberg, who uses a zero-width negative look-behind assertion.
1020
1021    print "found GX!\n" if    $martian =~ m/
1022        (?<![A-Z])
1023        (?:[A-Z][A-Z])*?
1024        GX
1025        /x;
1026
1027This succeeds if the "martian" character GX is in the string, and fails
1028otherwise. If you don't like using (?<!), a zero-width negative
1029look-behind assertion, you can replace (?<![A-Z]) with (?:^|[^A-Z]).
1030
1031It does have the drawback of putting the wrong thing in $-[0] and $+[0],
1032but this usually can be worked around.
1033
1034=head2 How do I match a regular expression that's in a variable?
1035X<regex, in variable> X<eval> X<regex> X<quotemeta> X<\Q, regex>
1036X<\E, regex> X<qr//>
1037
1038(contributed by brian d foy)
1039
1040We don't have to hard-code patterns into the match operator (or
1041anything else that works with regular expressions). We can put the
1042pattern in a variable for later use.
1043
1044The match operator is a double quote context, so you can interpolate
1045your variable just like a double quoted string. In this case, you
1046read the regular expression as user input and store it in C<$regex>.
1047Once you have the pattern in C<$regex>, you use that variable in the
1048match operator.
1049
1050    chomp( my $regex = <STDIN> );
1051
1052    if( $string =~ m/$regex/ ) { ... }
1053
1054Any regular expression special characters in C<$regex> are still
1055special, and the pattern still has to be valid or Perl will complain.
1056For instance, in this pattern there is an unpaired parenthesis.
1057
1058    my $regex = "Unmatched ( paren";
1059
1060    "Two parens to bind them all" =~ m/$regex/;
1061
1062When Perl compiles the regular expression, it treats the parenthesis
1063as the start of a memory match. When it doesn't find the closing
1064parenthesis, it complains:
1065
1066    Unmatched ( in regex; marked by <-- HERE in m/Unmatched ( <-- HERE  paren/ at script line 3.
1067
1068You can get around this in several ways depending on our situation.
1069First, if you don't want any of the characters in the string to be
1070special, you can escape them with C<quotemeta> before you use the string.
1071
1072    chomp( my $regex = <STDIN> );
1073    $regex = quotemeta( $regex );
1074
1075    if( $string =~ m/$regex/ ) { ... }
1076
1077You can also do this directly in the match operator using the C<\Q>
1078and C<\E> sequences. The C<\Q> tells Perl where to start escaping
1079special characters, and the C<\E> tells it where to stop (see L<perlop>
1080for more details).
1081
1082    chomp( my $regex = <STDIN> );
1083
1084    if( $string =~ m/\Q$regex\E/ ) { ... }
1085
1086Alternately, you can use C<qr//>, the regular expression quote operator (see
1087L<perlop> for more details). It quotes and perhaps compiles the pattern,
1088and you can apply regular expression flags to the pattern.
1089
1090    chomp( my $input = <STDIN> );
1091
1092    my $regex = qr/$input/is;
1093
1094    $string =~ m/$regex/  # same as m/$input/is;
1095
1096You might also want to trap any errors by wrapping an C<eval> block
1097around the whole thing.
1098
1099    chomp( my $input = <STDIN> );
1100
1101    eval {
1102        if( $string =~ m/\Q$input\E/ ) { ... }
1103    };
1104    warn $@ if $@;
1105
1106Or...
1107
1108    my $regex = eval { qr/$input/is };
1109    if( defined $regex ) {
1110        $string =~ m/$regex/;
1111    }
1112    else {
1113        warn $@;
1114    }
1115
1116=head1 AUTHOR AND COPYRIGHT
1117
1118Copyright (c) 1997-2010 Tom Christiansen, Nathan Torkington, and
1119other authors as noted. All rights reserved.
1120
1121This documentation is free; you can redistribute it and/or modify it
1122under the same terms as Perl itself.
1123
1124Irrespective of its distribution, all code examples in this file
1125are hereby placed into the public domain. You are permitted and
1126encouraged to use this code in your own programs for fun
1127or for profit as you see fit. A simple comment in the code giving
1128credit would be courteous but is not required.
1129