1=head1 NAME 2 3perlrequick - Perl regular expressions quick start 4 5=head1 DESCRIPTION 6 7This page covers the very basics of understanding, creating and 8using regular expressions ('regexes') in Perl. 9 10 11=head1 The Guide 12 13This page assumes you already know things, like what a "pattern" is, and 14the basic syntax of using them. If you don't, see L<perlretut>. 15 16=head2 Simple word matching 17 18The simplest regex is simply a word, or more generally, a string of 19characters. A regex consisting of a word matches any string that 20contains that word: 21 22 "Hello World" =~ /World/; # matches 23 24In this statement, C<World> is a regex and the C<//> enclosing 25C</World/> tells Perl to search a string for a match. The operator 26C<=~> associates the string with the regex match and produces a true 27value if the regex matched, or false if the regex did not match. In 28our case, C<World> matches the second word in C<"Hello World">, so the 29expression is true. This idea has several variations. 30 31Expressions like this are useful in conditionals: 32 33 print "It matches\n" if "Hello World" =~ /World/; 34 35The sense of the match can be reversed by using C<!~> operator: 36 37 print "It doesn't match\n" if "Hello World" !~ /World/; 38 39The literal string in the regex can be replaced by a variable: 40 41 $greeting = "World"; 42 print "It matches\n" if "Hello World" =~ /$greeting/; 43 44If you're matching against C<$_>, the C<$_ =~> part can be omitted: 45 46 $_ = "Hello World"; 47 print "It matches\n" if /World/; 48 49Finally, the C<//> default delimiters for a match can be changed to 50arbitrary delimiters by putting an C<'m'> out front: 51 52 "Hello World" =~ m!World!; # matches, delimited by '!' 53 "Hello World" =~ m{World}; # matches, note the matching '{}' 54 "/usr/bin/perl" =~ m"/perl"; # matches after '/usr/bin', 55 # '/' becomes an ordinary char 56 57Regexes must match a part of the string I<exactly> in order for the 58statement to be true: 59 60 "Hello World" =~ /world/; # doesn't match, case sensitive 61 "Hello World" =~ /o W/; # matches, ' ' is an ordinary char 62 "Hello World" =~ /World /; # doesn't match, no ' ' at end 63 64Perl will always match at the earliest possible point in the string: 65 66 "Hello World" =~ /o/; # matches 'o' in 'Hello' 67 "That hat is red" =~ /hat/; # matches 'hat' in 'That' 68 69Not all characters can be used 'as is' in a match. Some characters, 70called B<metacharacters>, are considered special, and reserved for use 71in regex notation. The metacharacters are 72 73 {}[]()^$.|*+?\ 74 75A metacharacter can be matched literally by putting a backslash before 76it: 77 78 "2+2=4" =~ /2+2/; # doesn't match, + is a metacharacter 79 "2+2=4" =~ /2\+2/; # matches, \+ is treated like an ordinary + 80 'C:\WIN32' =~ /C:\\WIN/; # matches 81 "/usr/bin/perl" =~ /\/usr\/bin\/perl/; # matches 82 83In the last regex, the forward slash C<'/'> is also backslashed, 84because it is used to delimit the regex. 85 86Most of the metacharacters aren't always special, and other characters 87(such as the ones delimiting the pattern) become special under various 88circumstances. This can be confusing and lead to unexpected results. 89L<S<C<use re 'strict'>>|re/'strict' mode> can notify you of potential 90pitfalls. 91 92Non-printable ASCII characters are represented by B<escape sequences>. 93Common examples are C<\t> for a tab, C<\n> for a newline, and C<\r> 94for a carriage return. Arbitrary bytes are represented by octal 95escape sequences, e.g., C<\033>, or hexadecimal escape sequences, 96e.g., C<\x1B>: 97 98 "1000\t2000" =~ m(0\t2) # matches 99 "cat" =~ /\143\x61\x74/ # matches in ASCII, but 100 # a weird way to spell cat 101 102Regexes are treated mostly as double-quoted strings, so variable 103substitution works: 104 105 $foo = 'house'; 106 'cathouse' =~ /cat$foo/; # matches 107 'housecat' =~ /${foo}cat/; # matches 108 109With all of the regexes above, if the regex matched anywhere in the 110string, it was considered a match. To specify I<where> it should 111match, we would use the B<anchor> metacharacters C<^> and C<$>. The 112anchor C<^> means match at the beginning of the string and the anchor 113C<$> means match at the end of the string, or before a newline at the 114end of the string. Some examples: 115 116 "housekeeper" =~ /keeper/; # matches 117 "housekeeper" =~ /^keeper/; # doesn't match 118 "housekeeper" =~ /keeper$/; # matches 119 "housekeeper\n" =~ /keeper$/; # matches 120 "housekeeper" =~ /^housekeeper$/; # matches 121 122=head2 Using character classes 123 124A B<character class> allows a set of possible characters, rather than 125just a single character, to match at a particular point in a regex. 126There are a number of different types of character classes, but usually 127when people use this term, they are referring to the type described in 128this section, which are technically called "Bracketed character 129classes", because they are denoted by brackets C<[...]>, with the set of 130characters to be possibly matched inside. But we'll drop the "bracketed" 131below to correspond with common usage. Here are some examples of 132(bracketed) character classes: 133 134 /cat/; # matches 'cat' 135 /[bcr]at/; # matches 'bat', 'cat', or 'rat' 136 "abc" =~ /[cab]/; # matches 'a' 137 138In the last statement, even though C<'c'> is the first character in 139the class, the earliest point at which the regex can match is C<'a'>. 140 141 /[yY][eE][sS]/; # match 'yes' in a case-insensitive way 142 # 'yes', 'Yes', 'YES', etc. 143 /yes/i; # also match 'yes' in a case-insensitive way 144 145The last example shows a match with an C<'i'> B<modifier>, which makes 146the match case-insensitive. 147 148Character classes also have ordinary and special characters, but the 149sets of ordinary and special characters inside a character class are 150different than those outside a character class. The special 151characters for a character class are C<-]\^$> and are matched using an 152escape: 153 154 /[\]c]def/; # matches ']def' or 'cdef' 155 $x = 'bcr'; 156 /[$x]at/; # matches 'bat, 'cat', or 'rat' 157 /[\$x]at/; # matches '$at' or 'xat' 158 /[\\$x]at/; # matches '\at', 'bat, 'cat', or 'rat' 159 160The special character C<'-'> acts as a range operator within character 161classes, so that the unwieldy C<[0123456789]> and C<[abc...xyz]> 162become the svelte C<[0-9]> and C<[a-z]>: 163 164 /item[0-9]/; # matches 'item0' or ... or 'item9' 165 /[0-9a-fA-F]/; # matches a hexadecimal digit 166 167If C<'-'> is the first or last character in a character class, it is 168treated as an ordinary character. 169 170The special character C<^> in the first position of a character class 171denotes a B<negated character class>, which matches any character but 172those in the brackets. Both C<[...]> and C<[^...]> must match a 173character, or the match fails. Then 174 175 /[^a]at/; # doesn't match 'aat' or 'at', but matches 176 # all other 'bat', 'cat, '0at', '%at', etc. 177 /[^0-9]/; # matches a non-numeric character 178 /[a^]at/; # matches 'aat' or '^at'; here '^' is ordinary 179 180Perl has several abbreviations for common character classes. (These 181definitions are those that Perl uses in ASCII-safe mode with the C</a> modifier. 182Otherwise they could match many more non-ASCII Unicode characters as 183well. See L<perlrecharclass/Backslash sequences> for details.) 184 185=over 4 186 187=item * 188 189\d is a digit and represents 190 191 [0-9] 192 193=item * 194 195\s is a whitespace character and represents 196 197 [\ \t\r\n\f] 198 199=item * 200 201\w is a word character (alphanumeric or _) and represents 202 203 [0-9a-zA-Z_] 204 205=item * 206 207\D is a negated \d; it represents any character but a digit 208 209 [^0-9] 210 211=item * 212 213\S is a negated \s; it represents any non-whitespace character 214 215 [^\s] 216 217=item * 218 219\W is a negated \w; it represents any non-word character 220 221 [^\w] 222 223=item * 224 225The period '.' matches any character but "\n" 226 227=back 228 229The C<\d\s\w\D\S\W> abbreviations can be used both inside and outside 230of character classes. Here are some in use: 231 232 /\d\d:\d\d:\d\d/; # matches a hh:mm:ss time format 233 /[\d\s]/; # matches any digit or whitespace character 234 /\w\W\w/; # matches a word char, followed by a 235 # non-word char, followed by a word char 236 /..rt/; # matches any two chars, followed by 'rt' 237 /end\./; # matches 'end.' 238 /end[.]/; # same thing, matches 'end.' 239 240The S<B<word anchor> > C<\b> matches a boundary between a word 241character and a non-word character C<\w\W> or C<\W\w>: 242 243 $x = "Housecat catenates house and cat"; 244 $x =~ /\bcat/; # matches cat in 'catenates' 245 $x =~ /cat\b/; # matches cat in 'housecat' 246 $x =~ /\bcat\b/; # matches 'cat' at end of string 247 248In the last example, the end of the string is considered a word 249boundary. 250 251For natural language processing (so that, for example, apostrophes are 252included in words), use instead C<\b{wb}> 253 254 "don't" =~ / .+? \b{wb} /x; # matches the whole string 255 256=head2 Matching this or that 257 258We can match different character strings with the B<alternation> 259metacharacter C<'|'>. To match C<dog> or C<cat>, we form the regex 260C<dog|cat>. As before, Perl will try to match the regex at the 261earliest possible point in the string. At each character position, 262Perl will first try to match the first alternative, C<dog>. If 263C<dog> doesn't match, Perl will then try the next alternative, C<cat>. 264If C<cat> doesn't match either, then the match fails and Perl moves to 265the next position in the string. Some examples: 266 267 "cats and dogs" =~ /cat|dog|bird/; # matches "cat" 268 "cats and dogs" =~ /dog|cat|bird/; # matches "cat" 269 270Even though C<dog> is the first alternative in the second regex, 271C<cat> is able to match earlier in the string. 272 273 "cats" =~ /c|ca|cat|cats/; # matches "c" 274 "cats" =~ /cats|cat|ca|c/; # matches "cats" 275 276At a given character position, the first alternative that allows the 277regex match to succeed will be the one that matches. Here, all the 278alternatives match at the first string position, so the first matches. 279 280=head2 Grouping things and hierarchical matching 281 282The B<grouping> metacharacters C<()> allow a part of a regex to be 283treated as a single unit. Parts of a regex are grouped by enclosing 284them in parentheses. The regex C<house(cat|keeper)> means match 285C<house> followed by either C<cat> or C<keeper>. Some more examples 286are 287 288 /(a|b)b/; # matches 'ab' or 'bb' 289 /(^a|b)c/; # matches 'ac' at start of string or 'bc' anywhere 290 291 /house(cat|)/; # matches either 'housecat' or 'house' 292 /house(cat(s|)|)/; # matches either 'housecats' or 'housecat' or 293 # 'house'. Note groups can be nested. 294 295 "20" =~ /(19|20|)\d\d/; # matches the null alternative '()\d\d', 296 # because '20\d\d' can't match 297 298=head2 Extracting matches 299 300The grouping metacharacters C<()> also allow the extraction of the 301parts of a string that matched. For each grouping, the part that 302matched inside goes into the special variables C<$1>, C<$2>, etc. 303They can be used just as ordinary variables: 304 305 # extract hours, minutes, seconds 306 $time =~ /(\d\d):(\d\d):(\d\d)/; # match hh:mm:ss format 307 $hours = $1; 308 $minutes = $2; 309 $seconds = $3; 310 311In list context, a match C</regex/> with groupings will return the 312list of matched values C<($1,$2,...)>. So we could rewrite it as 313 314 ($hours, $minutes, $second) = ($time =~ /(\d\d):(\d\d):(\d\d)/); 315 316If the groupings in a regex are nested, C<$1> gets the group with the 317leftmost opening parenthesis, C<$2> the next opening parenthesis, 318etc. For example, here is a complex regex and the matching variables 319indicated below it: 320 321 /(ab(cd|ef)((gi)|j))/; 322 1 2 34 323 324Associated with the matching variables C<$1>, C<$2>, ... are 325the B<backreferences> C<\g1>, C<\g2>, ... Backreferences are 326matching variables that can be used I<inside> a regex: 327 328 /(\w\w\w)\s\g1/; # find sequences like 'the the' in string 329 330C<$1>, C<$2>, ... should only be used outside of a regex, and C<\g1>, 331C<\g2>, ... only inside a regex. 332 333=head2 Matching repetitions 334 335The B<quantifier> metacharacters C<?>, C<*>, C<+>, and C<{}> allow us 336to determine the number of repeats of a portion of a regex we 337consider to be a match. Quantifiers are put immediately after the 338character, character class, or grouping that we want to specify. They 339have the following meanings: 340 341=over 4 342 343=item * 344 345C<a?> = match 'a' 1 or 0 times 346 347=item * 348 349C<a*> = match 'a' 0 or more times, i.e., any number of times 350 351=item * 352 353C<a+> = match 'a' 1 or more times, i.e., at least once 354 355=item * 356 357C<a{n,m}> = match at least C<n> times, but not more than C<m> 358times. 359 360=item * 361 362C<a{n,}> = match at least C<n> or more times 363 364=item * 365 366C<a{,n}> = match C<n> times or fewer (Added in v5.34) 367 368=item * 369 370C<a{n}> = match exactly C<n> times 371 372=back 373 374Here are some examples: 375 376 /[a-z]+\s+\d*/; # match a lowercase word, at least some space, and 377 # any number of digits 378 /(\w+)\s+\g1/; # match doubled words of arbitrary length 379 $year =~ /^\d{2,4}$/; # make sure year is at least 2 but not more 380 # than 4 digits 381 $year =~ /^\d{ 4 }$|^\d{2}$/; # better match; throw out 3 digit dates 382 383These quantifiers will try to match as much of the string as possible, 384while still allowing the regex to match. So we have 385 386 $x = 'the cat in the hat'; 387 $x =~ /^(.*)(at)(.*)$/; # matches, 388 # $1 = 'the cat in the h' 389 # $2 = 'at' 390 # $3 = '' (0 matches) 391 392The first quantifier C<.*> grabs as much of the string as possible 393while still having the regex match. The second quantifier C<.*> has 394no string left to it, so it matches 0 times. 395 396=head2 More matching 397 398There are a few more things you might want to know about matching 399operators. 400The global modifier C</g> allows the matching operator to match 401within a string as many times as possible. In scalar context, 402successive matches against a string will have C</g> jump from match 403to match, keeping track of position in the string as it goes along. 404You can get or set the position with the C<pos()> function. 405For example, 406 407 $x = "cat dog house"; # 3 words 408 while ($x =~ /(\w+)/g) { 409 print "Word is $1, ends at position ", pos $x, "\n"; 410 } 411 412prints 413 414 Word is cat, ends at position 3 415 Word is dog, ends at position 7 416 Word is house, ends at position 13 417 418A failed match or changing the target string resets the position. If 419you don't want the position reset after failure to match, add the 420C</c>, as in C</regex/gc>. 421 422In list context, C</g> returns a list of matched groupings, or if 423there are no groupings, a list of matches to the whole regex. So 424 425 @words = ($x =~ /(\w+)/g); # matches, 426 # $word[0] = 'cat' 427 # $word[1] = 'dog' 428 # $word[2] = 'house' 429 430=head2 Search and replace 431 432Search and replace is performed using C<s/regex/replacement/modifiers>. 433The C<replacement> is a Perl double-quoted string that replaces in the 434string whatever is matched with the C<regex>. The operator C<=~> is 435also used here to associate a string with C<s///>. If matching 436against C<$_>, the S<C<$_ =~>> can be dropped. If there is a match, 437C<s///> returns the number of substitutions made; otherwise it returns 438false. Here are a few examples: 439 440 $x = "Time to feed the cat!"; 441 $x =~ s/cat/hacker/; # $x contains "Time to feed the hacker!" 442 $y = "'quoted words'"; 443 $y =~ s/^'(.*)'$/$1/; # strip single quotes, 444 # $y contains "quoted words" 445 446With the C<s///> operator, the matched variables C<$1>, C<$2>, etc. 447are immediately available for use in the replacement expression. With 448the global modifier, C<s///g> will search and replace all occurrences 449of the regex in the string: 450 451 $x = "I batted 4 for 4"; 452 $x =~ s/4/four/; # $x contains "I batted four for 4" 453 $x = "I batted 4 for 4"; 454 $x =~ s/4/four/g; # $x contains "I batted four for four" 455 456The non-destructive modifier C<s///r> causes the result of the substitution 457to be returned instead of modifying C<$_> (or whatever variable the 458substitute was bound to with C<=~>): 459 460 $x = "I like dogs."; 461 $y = $x =~ s/dogs/cats/r; 462 print "$x $y\n"; # prints "I like dogs. I like cats." 463 464 $x = "Cats are great."; 465 print $x =~ s/Cats/Dogs/r =~ s/Dogs/Frogs/r =~ 466 s/Frogs/Hedgehogs/r, "\n"; 467 # prints "Hedgehogs are great." 468 469 @foo = map { s/[a-z]/X/r } qw(a b c 1 2 3); 470 # @foo is now qw(X X X 1 2 3) 471 472The evaluation modifier C<s///e> wraps an C<eval{...}> around the 473replacement string and the evaluated result is substituted for the 474matched substring. Some examples: 475 476 # reverse all the words in a string 477 $x = "the cat in the hat"; 478 $x =~ s/(\w+)/reverse $1/ge; # $x contains "eht tac ni eht tah" 479 480 # convert percentage to decimal 481 $x = "A 39% hit rate"; 482 $x =~ s!(\d+)%!$1/100!e; # $x contains "A 0.39 hit rate" 483 484The last example shows that C<s///> can use other delimiters, such as 485C<s!!!> and C<s{}{}>, and even C<s{}//>. If single quotes are used 486C<s'''>, then the regex and replacement are treated as single-quoted 487strings. 488 489=head2 The split operator 490 491C<split /regex/, string> splits C<string> into a list of substrings 492and returns that list. The regex determines the character sequence 493that C<string> is split with respect to. For example, to split a 494string into words, use 495 496 $x = "Calvin and Hobbes"; 497 @word = split /\s+/, $x; # $word[0] = 'Calvin' 498 # $word[1] = 'and' 499 # $word[2] = 'Hobbes' 500 501To extract a comma-delimited list of numbers, use 502 503 $x = "1.618,2.718, 3.142"; 504 @const = split /,\s*/, $x; # $const[0] = '1.618' 505 # $const[1] = '2.718' 506 # $const[2] = '3.142' 507 508If the empty regex C<//> is used, the string is split into individual 509characters. If the regex has groupings, then the list produced contains 510the matched substrings from the groupings as well: 511 512 $x = "/usr/bin"; 513 @parts = split m!(/)!, $x; # $parts[0] = '' 514 # $parts[1] = '/' 515 # $parts[2] = 'usr' 516 # $parts[3] = '/' 517 # $parts[4] = 'bin' 518 519Since the first character of $x matched the regex, C<split> prepended 520an empty initial element to the list. 521 522=head2 C<use re 'strict'> 523 524New in v5.22, this applies stricter rules than otherwise when compiling 525regular expression patterns. It can find things that, while legal, may 526not be what you intended. 527 528See L<'strict' in re|re/'strict' mode>. 529 530=head1 BUGS 531 532None. 533 534=head1 SEE ALSO 535 536This is just a quick start guide. For a more in-depth tutorial on 537regexes, see L<perlretut> and for the reference page, see L<perlre>. 538 539=head1 AUTHOR AND COPYRIGHT 540 541Copyright (c) 2000 Mark Kvale 542All rights reserved. 543 544This document may be distributed under the same terms as Perl itself. 545 546=head2 Acknowledgments 547 548The author would like to thank Mark-Jason Dominus, Tom Christiansen, 549Ilya Zakharevich, Brad Hughes, and Mike Giroux for all their helpful 550comments. 551 552=cut 553 554