1=head1 NAME 2 3perlfaq6 - Regular Expressions 4 5=head1 VERSION 6 7version 5.20200523 8 9=head1 DESCRIPTION 10 11This section is surprisingly small because the rest of the FAQ is 12littered with answers involving regular expressions. For example, 13decoding a URL and checking whether something is a number can be handled 14with regular expressions, but those answers are found elsewhere in 15this document (in L<perlfaq9>: "How do I decode or create those %-encodings 16on the web" and L<perlfaq4>: "How do I determine whether a scalar is 17a number/whole/integer/float", to be precise). 18 19=head2 How can I hope to use regular expressions without creating illegible and unmaintainable code? 20X<regex, legibility> X<regexp, legibility> 21X<regular expression, legibility> X</x> 22 23Three techniques can make regular expressions maintainable and 24understandable. 25 26=over 4 27 28=item Comments Outside the Regex 29 30Describe what you're doing and how you're doing it, using normal Perl 31comments. 32 33 # turn the line into the first word, a colon, and the 34 # number of characters on the rest of the line 35 s/^(\w+)(.*)/ lc($1) . ":" . length($2) /meg; 36 37=item Comments Inside the Regex 38 39The C</x> modifier causes whitespace to be ignored in a regex pattern 40(except in a character class and a few other places), and also allows you to 41use normal comments there, too. As you can imagine, whitespace and comments 42help a lot. 43 44C</x> lets you turn this: 45 46 s{<(?:[^>'"]*|".*?"|'.*?')+>}{}gs; 47 48into this: 49 50 s{ < # opening angle bracket 51 (?: # Non-backreffing grouping paren 52 [^>'"] * # 0 or more things that are neither > nor ' nor " 53 | # or else 54 ".*?" # a section between double quotes (stingy match) 55 | # or else 56 '.*?' # a section between single quotes (stingy match) 57 ) + # all occurring one or more times 58 > # closing angle bracket 59 }{}gsx; # replace with nothing, i.e. delete 60 61It's still not quite so clear as prose, but it is very useful for 62describing the meaning of each part of the pattern. 63 64=item Different Delimiters 65 66While we normally think of patterns as being delimited with C</> 67characters, they can be delimited by almost any character. L<perlre> 68describes this. For example, the C<s///> above uses braces as 69delimiters. Selecting another delimiter can avoid quoting the 70delimiter within the pattern: 71 72 s/\/usr\/local/\/usr\/share/g; # bad delimiter choice 73 s#/usr/local#/usr/share#g; # better 74 75Using logically paired delimiters can be even more readable: 76 77 s{/usr/local/}{/usr/share}g; # better still 78 79=back 80 81=head2 I'm having trouble matching over more than one line. What's wrong? 82X<regex, multiline> X<regexp, multiline> X<regular expression, multiline> 83 84Either you don't have more than one line in the string you're looking 85at (probably), or else you aren't using the correct modifier(s) on 86your pattern (possibly). 87 88There are many ways to get multiline data into a string. If you want 89it to happen automatically while reading input, you'll want to set $/ 90(probably to '' for paragraphs or C<undef> for the whole file) to 91allow you to read more than one line at a time. 92 93Read L<perlre> to help you decide which of C</s> and C</m> (or both) 94you might want to use: C</s> allows dot to include newline, and C</m> 95allows caret and dollar to match next to a newline, not just at the 96end of the string. You do need to make sure that you've actually 97got a multiline string in there. 98 99For example, this program detects duplicate words, even when they span 100line breaks (but not paragraph ones). For this example, we don't need 101C</s> because we aren't using dot in a regular expression that we want 102to cross line boundaries. Neither do we need C</m> because we don't 103want caret or dollar to match at any point inside the record next 104to newlines. But it's imperative that $/ be set to something other 105than the default, or else we won't actually ever have a multiline 106record read in. 107 108 $/ = ''; # read in whole paragraph, not just one line 109 while ( <> ) { 110 while ( /\b([\w'-]+)(\s+\g1)+\b/gi ) { # word starts alpha 111 print "Duplicate $1 at paragraph $.\n"; 112 } 113 } 114 115Here's some code that finds sentences that begin with "From " (which would 116be mangled by many mailers): 117 118 $/ = ''; # read in whole paragraph, not just one line 119 while ( <> ) { 120 while ( /^From /gm ) { # /m makes ^ match next to \n 121 print "leading From in paragraph $.\n"; 122 } 123 } 124 125Here's code that finds everything between START and END in a paragraph: 126 127 undef $/; # read in whole file, not just one line or paragraph 128 while ( <> ) { 129 while ( /START(.*?)END/sgm ) { # /s makes . cross line boundaries 130 print "$1\n"; 131 } 132 } 133 134=head2 How can I pull out lines between two patterns that are themselves on different lines? 135X<..> 136 137You can use Perl's somewhat exotic C<..> operator (documented in 138L<perlop>): 139 140 perl -ne 'print if /START/ .. /END/' file1 file2 ... 141 142If you wanted text and not lines, you would use 143 144 perl -0777 -ne 'print "$1\n" while /START(.*?)END/gs' file1 file2 ... 145 146But if you want nested occurrences of C<START> through C<END>, you'll 147run up against the problem described in the question in this section 148on matching balanced text. 149 150Here's another example of using C<..>: 151 152 while (<>) { 153 my $in_header = 1 .. /^$/; 154 my $in_body = /^$/ .. eof; 155 # now choose between them 156 } continue { 157 $. = 0 if eof; # fix $. 158 } 159 160=head2 How do I match XML, HTML, or other nasty, ugly things with a regex? 161X<regex, XML> X<regex, HTML> X<XML> X<HTML> X<pain> X<frustration> 162X<sucking out, will to live> 163 164Do not use regexes. Use a module and forget about the 165regular expressions. The L<XML::LibXML>, L<HTML::TokeParser> and 166L<HTML::TreeBuilder> modules are good starts, although each namespace 167has other parsing modules specialized for certain tasks and different 168ways of doing it. Start at CPAN Search ( L<http://metacpan.org/> ) 169and wonder at all the work people have done for you already! :) 170 171=head2 I put a regular expression into $/ but it didn't work. What's wrong? 172X<$/, regexes in> X<$INPUT_RECORD_SEPARATOR, regexes in> 173X<$RS, regexes in> 174 175$/ has to be a string. You can use these examples if you really need to 176do this. 177 178If you have L<File::Stream>, this is easy. 179 180 use File::Stream; 181 182 my $stream = File::Stream->new( 183 $filehandle, 184 separator => qr/\s*,\s*/, 185 ); 186 187 print "$_\n" while <$stream>; 188 189If you don't have File::Stream, you have to do a little more work. 190 191You can use the four-argument form of sysread to continually add to 192a buffer. After you add to the buffer, you check if you have a 193complete line (using your regular expression). 194 195 local $_ = ""; 196 while( sysread FH, $_, 8192, length ) { 197 while( s/^((?s).*?)your_pattern// ) { 198 my $record = $1; 199 # do stuff here. 200 } 201 } 202 203You can do the same thing with foreach and a match using the 204c flag and the \G anchor, if you do not mind your entire file 205being in memory at the end. 206 207 local $_ = ""; 208 while( sysread FH, $_, 8192, length ) { 209 foreach my $record ( m/\G((?s).*?)your_pattern/gc ) { 210 # do stuff here. 211 } 212 substr( $_, 0, pos ) = "" if pos; 213 } 214 215 216=head2 How do I substitute case-insensitively on the LHS while preserving case on the RHS? 217X<replace, case preserving> X<substitute, case preserving> 218X<substitution, case preserving> X<s, case preserving> 219 220Here's a lovely Perlish solution by Larry Rosler. It exploits 221properties of bitwise xor on ASCII strings. 222 223 $_= "this is a TEsT case"; 224 225 $old = 'test'; 226 $new = 'success'; 227 228 s{(\Q$old\E)} 229 { uc $new | (uc $1 ^ $1) . 230 (uc(substr $1, -1) ^ substr $1, -1) x 231 (length($new) - length $1) 232 }egi; 233 234 print; 235 236And here it is as a subroutine, modeled after the above: 237 238 sub preserve_case { 239 my ($old, $new) = @_; 240 my $mask = uc $old ^ $old; 241 242 uc $new | $mask . 243 substr($mask, -1) x (length($new) - length($old)) 244 } 245 246 $string = "this is a TEsT case"; 247 $string =~ s/(test)/preserve_case($1, "success")/egi; 248 print "$string\n"; 249 250This prints: 251 252 this is a SUcCESS case 253 254As an alternative, to keep the case of the replacement word if it is 255longer than the original, you can use this code, by Jeff Pinyan: 256 257 sub preserve_case { 258 my ($from, $to) = @_; 259 my ($lf, $lt) = map length, @_; 260 261 if ($lt < $lf) { $from = substr $from, 0, $lt } 262 else { $from .= substr $to, $lf } 263 264 return uc $to | ($from ^ uc $from); 265 } 266 267This changes the sentence to "this is a SUcCess case." 268 269Just to show that C programmers can write C in any programming language, 270if you prefer a more C-like solution, the following script makes the 271substitution have the same case, letter by letter, as the original. 272(It also happens to run about 240% slower than the Perlish solution runs.) 273If the substitution has more characters than the string being substituted, 274the case of the last character is used for the rest of the substitution. 275 276 # Original by Nathan Torkington, massaged by Jeffrey Friedl 277 # 278 sub preserve_case 279 { 280 my ($old, $new) = @_; 281 my $state = 0; # 0 = no change; 1 = lc; 2 = uc 282 my ($i, $oldlen, $newlen, $c) = (0, length($old), length($new)); 283 my $len = $oldlen < $newlen ? $oldlen : $newlen; 284 285 for ($i = 0; $i < $len; $i++) { 286 if ($c = substr($old, $i, 1), $c =~ /[\W\d_]/) { 287 $state = 0; 288 } elsif (lc $c eq $c) { 289 substr($new, $i, 1) = lc(substr($new, $i, 1)); 290 $state = 1; 291 } else { 292 substr($new, $i, 1) = uc(substr($new, $i, 1)); 293 $state = 2; 294 } 295 } 296 # finish up with any remaining new (for when new is longer than old) 297 if ($newlen > $oldlen) { 298 if ($state == 1) { 299 substr($new, $oldlen) = lc(substr($new, $oldlen)); 300 } elsif ($state == 2) { 301 substr($new, $oldlen) = uc(substr($new, $oldlen)); 302 } 303 } 304 return $new; 305 } 306 307=head2 How can I make C<\w> match national character sets? 308X<\w> 309 310Put C<use locale;> in your script. The \w character class is taken 311from the current locale. 312 313See L<perllocale> for details. 314 315=head2 How can I match a locale-smart version of C</[a-zA-Z]/>? 316X<alpha> 317 318You can use the POSIX character class syntax C</[[:alpha:]]/> 319documented in L<perlre>. 320 321No matter which locale you are in, the alphabetic characters are 322the characters in \w without the digits and the underscore. 323As a regex, that looks like C</[^\W\d_]/>. Its complement, 324the non-alphabetics, is then everything in \W along with 325the digits and the underscore, or C</[\W\d_]/>. 326 327=head2 How can I quote a variable to use in a regex? 328X<regex, escaping> X<regexp, escaping> X<regular expression, escaping> 329 330The Perl parser will expand $variable and @variable references in 331regular expressions unless the delimiter is a single quote. Remember, 332too, that the right-hand side of a C<s///> substitution is considered 333a double-quoted string (see L<perlop> for more details). Remember 334also that any regex special characters will be acted on unless you 335precede the substitution with \Q. Here's an example: 336 337 $string = "Placido P. Octopus"; 338 $regex = "P."; 339 340 $string =~ s/$regex/Polyp/; 341 # $string is now "Polypacido P. Octopus" 342 343Because C<.> is special in regular expressions, and can match any 344single character, the regex C<P.> here has matched the <Pl> in the 345original string. 346 347To escape the special meaning of C<.>, we use C<\Q>: 348 349 $string = "Placido P. Octopus"; 350 $regex = "P."; 351 352 $string =~ s/\Q$regex/Polyp/; 353 # $string is now "Placido Polyp Octopus" 354 355The use of C<\Q> causes the C<.> in the regex to be treated as a 356regular character, so that C<P.> matches a C<P> followed by a dot. 357 358=head2 What is C</o> really for? 359X</o, regular expressions> X<compile, regular expressions> 360 361(contributed by brian d foy) 362 363The C</o> option for regular expressions (documented in L<perlop> and 364L<perlreref>) tells Perl to compile the regular expression only once. 365This is only useful when the pattern contains a variable. Perls 5.6 366and later handle this automatically if the pattern does not change. 367 368Since the match operator C<m//>, the substitution operator C<s///>, 369and the regular expression quoting operator C<qr//> are double-quotish 370constructs, you can interpolate variables into the pattern. See the 371answer to "How can I quote a variable to use in a regex?" for more 372details. 373 374This example takes a regular expression from the argument list and 375prints the lines of input that match it: 376 377 my $pattern = shift @ARGV; 378 379 while( <> ) { 380 print if m/$pattern/; 381 } 382 383Versions of Perl prior to 5.6 would recompile the regular expression 384for each iteration, even if C<$pattern> had not changed. The C</o> 385would prevent this by telling Perl to compile the pattern the first 386time, then reuse that for subsequent iterations: 387 388 my $pattern = shift @ARGV; 389 390 while( <> ) { 391 print if m/$pattern/o; # useful for Perl < 5.6 392 } 393 394In versions 5.6 and later, Perl won't recompile the regular expression 395if the variable hasn't changed, so you probably don't need the C</o> 396option. It doesn't hurt, but it doesn't help either. If you want any 397version of Perl to compile the regular expression only once even if 398the variable changes (thus, only using its initial value), you still 399need the C</o>. 400 401You can watch Perl's regular expression engine at work to verify for 402yourself if Perl is recompiling a regular expression. The C<use re 403'debug'> pragma (comes with Perl 5.005 and later) shows the details. 404With Perls before 5.6, you should see C<re> reporting that its 405compiling the regular expression on each iteration. With Perl 5.6 or 406later, you should only see C<re> report that for the first iteration. 407 408 use re 'debug'; 409 410 my $regex = 'Perl'; 411 foreach ( qw(Perl Java Ruby Python) ) { 412 print STDERR "-" x 73, "\n"; 413 print STDERR "Trying $_...\n"; 414 print STDERR "\t$_ is good!\n" if m/$regex/; 415 } 416 417=head2 How do I use a regular expression to strip C-style comments from a file? 418 419While this actually can be done, it's much harder than you'd think. 420For example, this one-liner 421 422 perl -0777 -pe 's{/\*.*?\*/}{}gs' foo.c 423 424will work in many but not all cases. You see, it's too simple-minded for 425certain kinds of C programs, in particular, those with what appear to be 426comments in quoted strings. For that, you'd need something like this, 427created by Jeffrey Friedl and later modified by Fred Curtis. 428 429 $/ = undef; 430 $_ = <>; 431 s#/\*[^*]*\*+([^/*][^*]*\*+)*/|("(\\.|[^"\\])*"|'(\\.|[^'\\])*'|.[^/"'\\]*)#defined $2 ? $2 : ""#gse; 432 print; 433 434This could, of course, be more legibly written with the C</x> modifier, adding 435whitespace and comments. Here it is expanded, courtesy of Fred Curtis. 436 437 s{ 438 /\* ## Start of /* ... */ comment 439 [^*]*\*+ ## Non-* followed by 1-or-more *'s 440 ( 441 [^/*][^*]*\*+ 442 )* ## 0-or-more things which don't start with / 443 ## but do end with '*' 444 / ## End of /* ... */ comment 445 446 | ## OR various things which aren't comments: 447 448 ( 449 " ## Start of " ... " string 450 ( 451 \\. ## Escaped char 452 | ## OR 453 [^"\\] ## Non "\ 454 )* 455 " ## End of " ... " string 456 457 | ## OR 458 459 ' ## Start of ' ... ' string 460 ( 461 \\. ## Escaped char 462 | ## OR 463 [^'\\] ## Non '\ 464 )* 465 ' ## End of ' ... ' string 466 467 | ## OR 468 469 . ## Anything other char 470 [^/"'\\]* ## Chars which doesn't start a comment, string or escape 471 ) 472 }{defined $2 ? $2 : ""}gxse; 473 474A slight modification also removes C++ comments, possibly spanning multiple lines 475using a continuation character: 476 477 s#/\*[^*]*\*+([^/*][^*]*\*+)*/|//([^\\]|[^\n][\n]?)*?\n|("(\\.|[^"\\])*"|'(\\.|[^'\\])*'|.[^/"'\\]*)#defined $3 ? $3 : ""#gse; 478 479=head2 Can I use Perl regular expressions to match balanced text? 480X<regex, matching balanced test> X<regexp, matching balanced test> 481X<regular expression, matching balanced test> X<possessive> X<PARNO> 482X<Text::Balanced> X<Regexp::Common> X<backtracking> X<recursion> 483 484(contributed by brian d foy) 485 486Your first try should probably be the L<Text::Balanced> module, which 487is in the Perl standard library since Perl 5.8. It has a variety of 488functions to deal with tricky text. The L<Regexp::Common> module can 489also help by providing canned patterns you can use. 490 491As of Perl 5.10, you can match balanced text with regular expressions 492using recursive patterns. Before Perl 5.10, you had to resort to 493various tricks such as using Perl code in C<(??{})> sequences. 494 495Here's an example using a recursive regular expression. The goal is to 496capture all of the text within angle brackets, including the text in 497nested angle brackets. This sample text has two "major" groups: a 498group with one level of nesting and a group with two levels of 499nesting. There are five total groups in angle brackets: 500 501 I have some <brackets in <nested brackets> > and 502 <another group <nested once <nested twice> > > 503 and that's it. 504 505The regular expression to match the balanced text uses two new (to 506Perl 5.10) regular expression features. These are covered in L<perlre> 507and this example is a modified version of one in that documentation. 508 509First, adding the new possessive C<+> to any quantifier finds the 510longest match and does not backtrack. That's important since you want 511to handle any angle brackets through the recursion, not backtracking. 512The group C<< [^<>]++ >> finds one or more non-angle brackets without 513backtracking. 514 515Second, the new C<(?PARNO)> refers to the sub-pattern in the 516particular capture group given by C<PARNO>. In the following regex, 517the first capture group finds (and remembers) the balanced text, and 518you need that same pattern within the first buffer to get past the 519nested text. That's the recursive part. The C<(?1)> uses the pattern 520in the outer capture group as an independent part of the regex. 521 522Putting it all together, you have: 523 524 #!/usr/local/bin/perl5.10.0 525 526 my $string =<<"HERE"; 527 I have some <brackets in <nested brackets> > and 528 <another group <nested once <nested twice> > > 529 and that's it. 530 HERE 531 532 my @groups = $string =~ m/ 533 ( # start of capture group 1 534 < # match an opening angle bracket 535 (?: 536 [^<>]++ # one or more non angle brackets, non backtracking 537 | 538 (?1) # found < or >, so recurse to capture group 1 539 )* 540 > # match a closing angle bracket 541 ) # end of capture group 1 542 /xg; 543 544 $" = "\n\t"; 545 print "Found:\n\t@groups\n"; 546 547The output shows that Perl found the two major groups: 548 549 Found: 550 <brackets in <nested brackets> > 551 <another group <nested once <nested twice> > > 552 553With a little extra work, you can get all of the groups in angle 554brackets even if they are in other angle brackets too. Each time you 555get a balanced match, remove its outer delimiter (that's the one you 556just matched so don't match it again) and add it to a queue of strings 557to process. Keep doing that until you get no matches: 558 559 #!/usr/local/bin/perl5.10.0 560 561 my @queue =<<"HERE"; 562 I have some <brackets in <nested brackets> > and 563 <another group <nested once <nested twice> > > 564 and that's it. 565 HERE 566 567 my $regex = qr/ 568 ( # start of bracket 1 569 < # match an opening angle bracket 570 (?: 571 [^<>]++ # one or more non angle brackets, non backtracking 572 | 573 (?1) # recurse to bracket 1 574 )* 575 > # match a closing angle bracket 576 ) # end of bracket 1 577 /x; 578 579 $" = "\n\t"; 580 581 while( @queue ) { 582 my $string = shift @queue; 583 584 my @groups = $string =~ m/$regex/g; 585 print "Found:\n\t@groups\n\n" if @groups; 586 587 unshift @queue, map { s/^<//; s/>$//; $_ } @groups; 588 } 589 590The output shows all of the groups. The outermost matches show up 591first and the nested matches show up later: 592 593 Found: 594 <brackets in <nested brackets> > 595 <another group <nested once <nested twice> > > 596 597 Found: 598 <nested brackets> 599 600 Found: 601 <nested once <nested twice> > 602 603 Found: 604 <nested twice> 605 606=head2 What does it mean that regexes are greedy? How can I get around it? 607X<greedy> X<greediness> 608 609Most people mean that greedy regexes match as much as they can. 610Technically speaking, it's actually the quantifiers (C<?>, C<*>, C<+>, 611C<{}>) that are greedy rather than the whole pattern; Perl prefers local 612greed and immediate gratification to overall greed. To get non-greedy 613versions of the same quantifiers, use (C<??>, C<*?>, C<+?>, C<{}?>). 614 615An example: 616 617 my $s1 = my $s2 = "I am very very cold"; 618 $s1 =~ s/ve.*y //; # I am cold 619 $s2 =~ s/ve.*?y //; # I am very cold 620 621Notice how the second substitution stopped matching as soon as it 622encountered "y ". The C<*?> quantifier effectively tells the regular 623expression engine to find a match as quickly as possible and pass 624control on to whatever is next in line, as you would if you were 625playing hot potato. 626 627=head2 How do I process each word on each line? 628X<word> 629 630Use the split function: 631 632 while (<>) { 633 foreach my $word ( split ) { 634 # do something with $word here 635 } 636 } 637 638Note that this isn't really a word in the English sense; it's just 639chunks of consecutive non-whitespace characters. 640 641To work with only alphanumeric sequences (including underscores), you 642might consider 643 644 while (<>) { 645 foreach $word (m/(\w+)/g) { 646 # do something with $word here 647 } 648 } 649 650=head2 How can I print out a word-frequency or line-frequency summary? 651 652To do this, you have to parse out each word in the input stream. We'll 653pretend that by word you mean chunk of alphabetics, hyphens, or 654apostrophes, rather than the non-whitespace chunk idea of a word given 655in the previous question: 656 657 my (%seen); 658 while (<>) { 659 while ( /(\b[^\W_\d][\w'-]+\b)/g ) { # misses "`sheep'" 660 $seen{$1}++; 661 } 662 } 663 664 while ( my ($word, $count) = each %seen ) { 665 print "$count $word\n"; 666 } 667 668If you wanted to do the same thing for lines, you wouldn't need a 669regular expression: 670 671 my (%seen); 672 673 while (<>) { 674 $seen{$_}++; 675 } 676 677 while ( my ($line, $count) = each %seen ) { 678 print "$count $line"; 679 } 680 681If you want these output in a sorted order, see L<perlfaq4>: "How do I 682sort a hash (optionally by value instead of key)?". 683 684=head2 How can I do approximate matching? 685X<match, approximate> X<matching, approximate> 686 687See the module L<String::Approx> available from CPAN. 688 689=head2 How do I efficiently match many regular expressions at once? 690X<regex, efficiency> X<regexp, efficiency> 691X<regular expression, efficiency> 692 693(contributed by brian d foy) 694 695If you have Perl 5.10 or later, this is almost trivial. You just smart 696match against an array of regular expression objects: 697 698 my @patterns = ( qr/Fr.d/, qr/B.rn.y/, qr/W.lm./ ); 699 700 if( $string ~~ @patterns ) { 701 ... 702 }; 703 704The smart match stops when it finds a match, so it doesn't have to try 705every expression. 706 707Earlier than Perl 5.10, you have a bit of work to do. You want to 708avoid compiling a regular expression every time you want to match it. 709In this example, perl must recompile the regular expression for every 710iteration of the C<foreach> loop since it has no way to know what 711C<$pattern> will be: 712 713 my @patterns = qw( foo bar baz ); 714 715 LINE: while( <DATA> ) { 716 foreach $pattern ( @patterns ) { 717 if( /\b$pattern\b/i ) { 718 print; 719 next LINE; 720 } 721 } 722 } 723 724The C<qr//> operator showed up in perl 5.005. It compiles a regular 725expression, but doesn't apply it. When you use the pre-compiled 726version of the regex, perl does less work. In this example, I inserted 727a C<map> to turn each pattern into its pre-compiled form. The rest of 728the script is the same, but faster: 729 730 my @patterns = map { qr/\b$_\b/i } qw( foo bar baz ); 731 732 LINE: while( <> ) { 733 foreach $pattern ( @patterns ) { 734 if( /$pattern/ ) { 735 print; 736 next LINE; 737 } 738 } 739 } 740 741In some cases, you may be able to make several patterns into a single 742regular expression. Beware of situations that require backtracking 743though. 744 745 my $regex = join '|', qw( foo bar baz ); 746 747 LINE: while( <> ) { 748 print if /\b(?:$regex)\b/i; 749 } 750 751For more details on regular expression efficiency, see I<Mastering 752Regular Expressions> by Jeffrey Friedl. He explains how the regular 753expressions engine works and why some patterns are surprisingly 754inefficient. Once you understand how perl applies regular expressions, 755you can tune them for individual situations. 756 757=head2 Why don't word-boundary searches with C<\b> work for me? 758X<\b> 759 760(contributed by brian d foy) 761 762Ensure that you know what \b really does: it's the boundary between a 763word character, \w, and something that isn't a word character. That 764thing that isn't a word character might be \W, but it can also be the 765start or end of the string. 766 767It's not (not!) the boundary between whitespace and non-whitespace, 768and it's not the stuff between words we use to create sentences. 769 770In regex speak, a word boundary (\b) is a "zero width assertion", 771meaning that it doesn't represent a character in the string, but a 772condition at a certain position. 773 774For the regular expression, /\bPerl\b/, there has to be a word 775boundary before the "P" and after the "l". As long as something other 776than a word character precedes the "P" and succeeds the "l", the 777pattern will match. These strings match /\bPerl\b/. 778 779 "Perl" # no word char before "P" or after "l" 780 "Perl " # same as previous (space is not a word char) 781 "'Perl'" # the "'" char is not a word char 782 "Perl's" # no word char before "P", non-word char after "l" 783 784These strings do not match /\bPerl\b/. 785 786 "Perl_" # "_" is a word char! 787 "Perler" # no word char before "P", but one after "l" 788 789You don't have to use \b to match words though. You can look for 790non-word characters surrounded by word characters. These strings 791match the pattern /\b'\b/. 792 793 "don't" # the "'" char is surrounded by "n" and "t" 794 "qep'a'" # the "'" char is surrounded by "p" and "a" 795 796These strings do not match /\b'\b/. 797 798 "foo'" # there is no word char after non-word "'" 799 800You can also use the complement of \b, \B, to specify that there 801should not be a word boundary. 802 803In the pattern /\Bam\B/, there must be a word character before the "a" 804and after the "m". These patterns match /\Bam\B/: 805 806 "llama" # "am" surrounded by word chars 807 "Samuel" # same 808 809These strings do not match /\Bam\B/ 810 811 "Sam" # no word boundary before "a", but one after "m" 812 "I am Sam" # "am" surrounded by non-word chars 813 814 815=head2 Why does using $&, $`, or $' slow my program down? 816X<$MATCH> X<$&> X<$POSTMATCH> X<$'> X<$PREMATCH> X<$`> 817 818(contributed by Anno Siegel) 819 820Once Perl sees that you need one of these variables anywhere in the 821program, it provides them on each and every pattern match. That means 822that on every pattern match the entire string will be copied, part of it 823to $`, part to $&, and part to $'. Thus the penalty is most severe with 824long strings and patterns that match often. Avoid $&, $', and $` if you 825can, but if you can't, once you've used them at all, use them at will 826because you've already paid the price. Remember that some algorithms 827really appreciate them. As of the 5.005 release, the $& variable is no 828longer "expensive" the way the other two are. 829 830Since Perl 5.6.1 the special variables @- and @+ can functionally replace 831$`, $& and $'. These arrays contain pointers to the beginning and end 832of each match (see perlvar for the full story), so they give you 833essentially the same information, but without the risk of excessive 834string copying. 835 836Perl 5.10 added three specials, C<${^MATCH}>, C<${^PREMATCH}>, and 837C<${^POSTMATCH}> to do the same job but without the global performance 838penalty. Perl 5.10 only sets these variables if you compile or execute the 839regular expression with the C</p> modifier. 840 841=head2 What good is C<\G> in a regular expression? 842X<\G> 843 844You use the C<\G> anchor to start the next match on the same 845string where the last match left off. The regular 846expression engine cannot skip over any characters to find 847the next match with this anchor, so C<\G> is similar to the 848beginning of string anchor, C<^>. The C<\G> anchor is typically 849used with the C<g> modifier. It uses the value of C<pos()> 850as the position to start the next match. As the match 851operator makes successive matches, it updates C<pos()> with the 852position of the next character past the last match (or the 853first character of the next match, depending on how you like 854to look at it). Each string has its own C<pos()> value. 855 856Suppose you want to match all of consecutive pairs of digits 857in a string like "1122a44" and stop matching when you 858encounter non-digits. You want to match C<11> and C<22> but 859the letter C<a> shows up between C<22> and C<44> and you want 860to stop at C<a>. Simply matching pairs of digits skips over 861the C<a> and still matches C<44>. 862 863 $_ = "1122a44"; 864 my @pairs = m/(\d\d)/g; # qw( 11 22 44 ) 865 866If you use the C<\G> anchor, you force the match after C<22> to 867start with the C<a>. The regular expression cannot match 868there since it does not find a digit, so the next match 869fails and the match operator returns the pairs it already 870found. 871 872 $_ = "1122a44"; 873 my @pairs = m/\G(\d\d)/g; # qw( 11 22 ) 874 875You can also use the C<\G> anchor in scalar context. You 876still need the C<g> modifier. 877 878 $_ = "1122a44"; 879 while( m/\G(\d\d)/g ) { 880 print "Found $1\n"; 881 } 882 883After the match fails at the letter C<a>, perl resets C<pos()> 884and the next match on the same string starts at the beginning. 885 886 $_ = "1122a44"; 887 while( m/\G(\d\d)/g ) { 888 print "Found $1\n"; 889 } 890 891 print "Found $1 after while" if m/(\d\d)/g; # finds "11" 892 893You can disable C<pos()> resets on fail with the C<c> modifier, documented 894in L<perlop> and L<perlreref>. Subsequent matches start where the last 895successful match ended (the value of C<pos()>) even if a match on the 896same string has failed in the meantime. In this case, the match after 897the C<while()> loop starts at the C<a> (where the last match stopped), 898and since it does not use any anchor it can skip over the C<a> to find 899C<44>. 900 901 $_ = "1122a44"; 902 while( m/\G(\d\d)/gc ) { 903 print "Found $1\n"; 904 } 905 906 print "Found $1 after while" if m/(\d\d)/g; # finds "44" 907 908Typically you use the C<\G> anchor with the C<c> modifier 909when you want to try a different match if one fails, 910such as in a tokenizer. Jeffrey Friedl offers this example 911which works in 5.004 or later. 912 913 while (<>) { 914 chomp; 915 PARSER: { 916 m/ \G( \d+\b )/gcx && do { print "number: $1\n"; redo; }; 917 m/ \G( \w+ )/gcx && do { print "word: $1\n"; redo; }; 918 m/ \G( \s+ )/gcx && do { print "space: $1\n"; redo; }; 919 m/ \G( [^\w\d]+ )/gcx && do { print "other: $1\n"; redo; }; 920 } 921 } 922 923For each line, the C<PARSER> loop first tries to match a series 924of digits followed by a word boundary. This match has to 925start at the place the last match left off (or the beginning 926of the string on the first match). Since C<m/ \G( \d+\b 927)/gcx> uses the C<c> modifier, if the string does not match that 928regular expression, perl does not reset pos() and the next 929match starts at the same position to try a different 930pattern. 931 932=head2 Are Perl regexes DFAs or NFAs? Are they POSIX compliant? 933X<DFA> X<NFA> X<POSIX> 934 935While it's true that Perl's regular expressions resemble the DFAs 936(deterministic finite automata) of the egrep(1) program, they are in 937fact implemented as NFAs (non-deterministic finite automata) to allow 938backtracking and backreferencing. And they aren't POSIX-style either, 939because those guarantee worst-case behavior for all cases. (It seems 940that some people prefer guarantees of consistency, even when what's 941guaranteed is slowness.) See the book "Mastering Regular Expressions" 942(from O'Reilly) by Jeffrey Friedl for all the details you could ever 943hope to know on these matters (a full citation appears in 944L<perlfaq2>). 945 946=head2 What's wrong with using grep in a void context? 947X<grep> 948 949The problem is that grep builds a return list, regardless of the context. 950This means you're making Perl go to the trouble of building a list that 951you then just throw away. If the list is large, you waste both time and space. 952If your intent is to iterate over the list, then use a for loop for this 953purpose. 954 955In perls older than 5.8.1, map suffers from this problem as well. 956But since 5.8.1, this has been fixed, and map is context aware - in void 957context, no lists are constructed. 958 959=head2 How can I match strings with multibyte characters? 960X<regex, and multibyte characters> X<regexp, and multibyte characters> 961X<regular expression, and multibyte characters> X<martian> X<encoding, Martian> 962 963Starting from Perl 5.6 Perl has had some level of multibyte character 964support. Perl 5.8 or later is recommended. Supported multibyte 965character repertoires include Unicode, and legacy encodings 966through the Encode module. See L<perluniintro>, L<perlunicode>, 967and L<Encode>. 968 969If you are stuck with older Perls, you can do Unicode with the 970L<Unicode::String> module, and character conversions using the 971L<Unicode::Map8> and L<Unicode::Map> modules. If you are using 972Japanese encodings, you might try using the jperl 5.005_03. 973 974Finally, the following set of approaches was offered by Jeffrey 975Friedl, whose article in issue #5 of The Perl Journal talks about 976this very matter. 977 978Let's suppose you have some weird Martian encoding where pairs of 979ASCII uppercase letters encode single Martian letters (i.e. the two 980bytes "CV" make a single Martian letter, as do the two bytes "SG", 981"VS", "XX", etc.). Other bytes represent single characters, just like 982ASCII. 983 984So, the string of Martian "I am CVSGXX!" uses 12 bytes to encode the 985nine characters 'I', ' ', 'a', 'm', ' ', 'CV', 'SG', 'XX', '!'. 986 987Now, say you want to search for the single character C</GX/>. Perl 988doesn't know about Martian, so it'll find the two bytes "GX" in the "I 989am CVSGXX!" string, even though that character isn't there: it just 990looks like it is because "SG" is next to "XX", but there's no real 991"GX". This is a big problem. 992 993Here are a few ways, all painful, to deal with it: 994 995 # Make sure adjacent "martian" bytes are no longer adjacent. 996 $martian =~ s/([A-Z][A-Z])/ $1 /g; 997 998 print "found GX!\n" if $martian =~ /GX/; 999 1000Or like this: 1001 1002 my @chars = $martian =~ m/([A-Z][A-Z]|[^A-Z])/g; 1003 # above is conceptually similar to: my @chars = $text =~ m/(.)/g; 1004 # 1005 foreach my $char (@chars) { 1006 print "found GX!\n", last if $char eq 'GX'; 1007 } 1008 1009Or like this: 1010 1011 while ($martian =~ m/\G([A-Z][A-Z]|.)/gs) { # \G probably unneeded 1012 if ($1 eq 'GX') { 1013 print "found GX!\n"; 1014 last; 1015 } 1016 } 1017 1018Here's another, slightly less painful, way to do it from Benjamin 1019Goldberg, who uses a zero-width negative look-behind assertion. 1020 1021 print "found GX!\n" if $martian =~ m/ 1022 (?<![A-Z]) 1023 (?:[A-Z][A-Z])*? 1024 GX 1025 /x; 1026 1027This succeeds if the "martian" character GX is in the string, and fails 1028otherwise. If you don't like using (?<!), a zero-width negative 1029look-behind assertion, you can replace (?<![A-Z]) with (?:^|[^A-Z]). 1030 1031It does have the drawback of putting the wrong thing in $-[0] and $+[0], 1032but this usually can be worked around. 1033 1034=head2 How do I match a regular expression that's in a variable? 1035X<regex, in variable> X<eval> X<regex> X<quotemeta> X<\Q, regex> 1036X<\E, regex> X<qr//> 1037 1038(contributed by brian d foy) 1039 1040We don't have to hard-code patterns into the match operator (or 1041anything else that works with regular expressions). We can put the 1042pattern in a variable for later use. 1043 1044The match operator is a double quote context, so you can interpolate 1045your variable just like a double quoted string. In this case, you 1046read the regular expression as user input and store it in C<$regex>. 1047Once you have the pattern in C<$regex>, you use that variable in the 1048match operator. 1049 1050 chomp( my $regex = <STDIN> ); 1051 1052 if( $string =~ m/$regex/ ) { ... } 1053 1054Any regular expression special characters in C<$regex> are still 1055special, and the pattern still has to be valid or Perl will complain. 1056For instance, in this pattern there is an unpaired parenthesis. 1057 1058 my $regex = "Unmatched ( paren"; 1059 1060 "Two parens to bind them all" =~ m/$regex/; 1061 1062When Perl compiles the regular expression, it treats the parenthesis 1063as the start of a memory match. When it doesn't find the closing 1064parenthesis, it complains: 1065 1066 Unmatched ( in regex; marked by <-- HERE in m/Unmatched ( <-- HERE paren/ at script line 3. 1067 1068You can get around this in several ways depending on our situation. 1069First, if you don't want any of the characters in the string to be 1070special, you can escape them with C<quotemeta> before you use the string. 1071 1072 chomp( my $regex = <STDIN> ); 1073 $regex = quotemeta( $regex ); 1074 1075 if( $string =~ m/$regex/ ) { ... } 1076 1077You can also do this directly in the match operator using the C<\Q> 1078and C<\E> sequences. The C<\Q> tells Perl where to start escaping 1079special characters, and the C<\E> tells it where to stop (see L<perlop> 1080for more details). 1081 1082 chomp( my $regex = <STDIN> ); 1083 1084 if( $string =~ m/\Q$regex\E/ ) { ... } 1085 1086Alternately, you can use C<qr//>, the regular expression quote operator (see 1087L<perlop> for more details). It quotes and perhaps compiles the pattern, 1088and you can apply regular expression flags to the pattern. 1089 1090 chomp( my $input = <STDIN> ); 1091 1092 my $regex = qr/$input/is; 1093 1094 $string =~ m/$regex/ # same as m/$input/is; 1095 1096You might also want to trap any errors by wrapping an C<eval> block 1097around the whole thing. 1098 1099 chomp( my $input = <STDIN> ); 1100 1101 eval { 1102 if( $string =~ m/\Q$input\E/ ) { ... } 1103 }; 1104 warn $@ if $@; 1105 1106Or... 1107 1108 my $regex = eval { qr/$input/is }; 1109 if( defined $regex ) { 1110 $string =~ m/$regex/; 1111 } 1112 else { 1113 warn $@; 1114 } 1115 1116=head1 AUTHOR AND COPYRIGHT 1117 1118Copyright (c) 1997-2010 Tom Christiansen, Nathan Torkington, and 1119other authors as noted. All rights reserved. 1120 1121This documentation is free; you can redistribute it and/or modify it 1122under the same terms as Perl itself. 1123 1124Irrespective of its distribution, all code examples in this file 1125are hereby placed into the public domain. You are permitted and 1126encouraged to use this code in your own programs for fun 1127or for profit as you see fit. A simple comment in the code giving 1128credit would be courteous but is not required. 1129