1=head1 NAME 2 3perlrequick - Perl regular expressions quick start 4 5=head1 DESCRIPTION 6 7This page covers the very basics of understanding, creating and 8using regular expressions ('regexes') in Perl. 9 10 11=head1 The Guide 12 13This page assumes you already know things, like what a "pattern" is, and 14the basic syntax of using them. If you don't, see L<perlretut>. 15 16=head2 Simple word matching 17 18The simplest regex is simply a word, or more generally, a string of 19characters. A regex consisting of a word matches any string that 20contains that word: 21 22 "Hello World" =~ /World/; # matches 23 24In this statement, C<World> is a regex and the C<//> enclosing 25C</World/> tells Perl to search a string for a match. The operator 26C<=~> associates the string with the regex match and produces a true 27value if the regex matched, or false if the regex did not match. In 28our case, C<World> matches the second word in C<"Hello World">, so the 29expression is true. This idea has several variations. 30 31Expressions like this are useful in conditionals: 32 33 print "It matches\n" if "Hello World" =~ /World/; 34 35The sense of the match can be reversed by using C<!~> operator: 36 37 print "It doesn't match\n" if "Hello World" !~ /World/; 38 39The literal string in the regex can be replaced by a variable: 40 41 $greeting = "World"; 42 print "It matches\n" if "Hello World" =~ /$greeting/; 43 44If you're matching against C<$_>, the C<$_ =~> part can be omitted: 45 46 $_ = "Hello World"; 47 print "It matches\n" if /World/; 48 49Finally, the C<//> default delimiters for a match can be changed to 50arbitrary delimiters by putting an C<'m'> out front: 51 52 "Hello World" =~ m!World!; # matches, delimited by '!' 53 "Hello World" =~ m{World}; # matches, note the matching '{}' 54 "/usr/bin/perl" =~ m"/perl"; # matches after '/usr/bin', 55 # '/' becomes an ordinary char 56 57Regexes must match a part of the string I<exactly> in order for the 58statement to be true: 59 60 "Hello World" =~ /world/; # doesn't match, case sensitive 61 "Hello World" =~ /o W/; # matches, ' ' is an ordinary char 62 "Hello World" =~ /World /; # doesn't match, no ' ' at end 63 64Perl will always match at the earliest possible point in the string: 65 66 "Hello World" =~ /o/; # matches 'o' in 'Hello' 67 "That hat is red" =~ /hat/; # matches 'hat' in 'That' 68 69Not all characters can be used 'as is' in a match. Some characters, 70called B<metacharacters>, are considered special, and reserved for use 71in regex notation. The metacharacters are 72 73 {}[]()^$.|*+?\ 74 75A metacharacter can be matched literally by putting a backslash before 76it: 77 78 "2+2=4" =~ /2+2/; # doesn't match, + is a metacharacter 79 "2+2=4" =~ /2\+2/; # matches, \+ is treated like an ordinary + 80 'C:\WIN32' =~ /C:\\WIN/; # matches 81 "/usr/bin/perl" =~ /\/usr\/bin\/perl/; # matches 82 83In the last regex, the forward slash C<'/'> is also backslashed, 84because it is used to delimit the regex. 85 86Most of the metacharacters aren't always special, and other characters 87(such as the ones delimitting the pattern) become special under various 88circumstances. This can be confusing and lead to unexpected results. 89L<S<C<use re 'strict'>>|re/'strict' mode> can notify you of potential 90pitfalls. 91 92Non-printable ASCII characters are represented by B<escape sequences>. 93Common examples are C<\t> for a tab, C<\n> for a newline, and C<\r> 94for a carriage return. Arbitrary bytes are represented by octal 95escape sequences, e.g., C<\033>, or hexadecimal escape sequences, 96e.g., C<\x1B>: 97 98 "1000\t2000" =~ m(0\t2) # matches 99 "cat" =~ /\143\x61\x74/ # matches in ASCII, but 100 # a weird way to spell cat 101 102Regexes are treated mostly as double-quoted strings, so variable 103substitution works: 104 105 $foo = 'house'; 106 'cathouse' =~ /cat$foo/; # matches 107 'housecat' =~ /${foo}cat/; # matches 108 109With all of the regexes above, if the regex matched anywhere in the 110string, it was considered a match. To specify I<where> it should 111match, we would use the B<anchor> metacharacters C<^> and C<$>. The 112anchor C<^> means match at the beginning of the string and the anchor 113C<$> means match at the end of the string, or before a newline at the 114end of the string. Some examples: 115 116 "housekeeper" =~ /keeper/; # matches 117 "housekeeper" =~ /^keeper/; # doesn't match 118 "housekeeper" =~ /keeper$/; # matches 119 "housekeeper\n" =~ /keeper$/; # matches 120 "housekeeper" =~ /^housekeeper$/; # matches 121 122=head2 Using character classes 123 124A B<character class> allows a set of possible characters, rather than 125just a single character, to match at a particular point in a regex. 126There are a number of different types of character classes, but usually 127when people use this term, they are referring to the type described in 128this section, which are technically called "Bracketed character 129classes", because they are denoted by brackets C<[...]>, with the set of 130characters to be possibly matched inside. But we'll drop the "bracketed" 131below to correspond with common usage. Here are some examples of 132(bracketed) character classes: 133 134 /cat/; # matches 'cat' 135 /[bcr]at/; # matches 'bat', 'cat', or 'rat' 136 "abc" =~ /[cab]/; # matches 'a' 137 138In the last statement, even though C<'c'> is the first character in 139the class, the earliest point at which the regex can match is C<'a'>. 140 141 /[yY][eE][sS]/; # match 'yes' in a case-insensitive way 142 # 'yes', 'Yes', 'YES', etc. 143 /yes/i; # also match 'yes' in a case-insensitive way 144 145The last example shows a match with an C<'i'> B<modifier>, which makes 146the match case-insensitive. 147 148Character classes also have ordinary and special characters, but the 149sets of ordinary and special characters inside a character class are 150different than those outside a character class. The special 151characters for a character class are C<-]\^$> and are matched using an 152escape: 153 154 /[\]c]def/; # matches ']def' or 'cdef' 155 $x = 'bcr'; 156 /[$x]at/; # matches 'bat, 'cat', or 'rat' 157 /[\$x]at/; # matches '$at' or 'xat' 158 /[\\$x]at/; # matches '\at', 'bat, 'cat', or 'rat' 159 160The special character C<'-'> acts as a range operator within character 161classes, so that the unwieldy C<[0123456789]> and C<[abc...xyz]> 162become the svelte C<[0-9]> and C<[a-z]>: 163 164 /item[0-9]/; # matches 'item0' or ... or 'item9' 165 /[0-9a-fA-F]/; # matches a hexadecimal digit 166 167If C<'-'> is the first or last character in a character class, it is 168treated as an ordinary character. 169 170The special character C<^> in the first position of a character class 171denotes a B<negated character class>, which matches any character but 172those in the brackets. Both C<[...]> and C<[^...]> must match a 173character, or the match fails. Then 174 175 /[^a]at/; # doesn't match 'aat' or 'at', but matches 176 # all other 'bat', 'cat, '0at', '%at', etc. 177 /[^0-9]/; # matches a non-numeric character 178 /[a^]at/; # matches 'aat' or '^at'; here '^' is ordinary 179 180Perl has several abbreviations for common character classes. (These 181definitions are those that Perl uses in ASCII-safe mode with the C</a> modifier. 182Otherwise they could match many more non-ASCII Unicode characters as 183well. See L<perlrecharclass/Backslash sequences> for details.) 184 185=over 4 186 187=item * 188 189\d is a digit and represents 190 191 [0-9] 192 193=item * 194 195\s is a whitespace character and represents 196 197 [\ \t\r\n\f] 198 199=item * 200 201\w is a word character (alphanumeric or _) and represents 202 203 [0-9a-zA-Z_] 204 205=item * 206 207\D is a negated \d; it represents any character but a digit 208 209 [^0-9] 210 211=item * 212 213\S is a negated \s; it represents any non-whitespace character 214 215 [^\s] 216 217=item * 218 219\W is a negated \w; it represents any non-word character 220 221 [^\w] 222 223=item * 224 225The period '.' matches any character but "\n" 226 227=back 228 229The C<\d\s\w\D\S\W> abbreviations can be used both inside and outside 230of character classes. Here are some in use: 231 232 /\d\d:\d\d:\d\d/; # matches a hh:mm:ss time format 233 /[\d\s]/; # matches any digit or whitespace character 234 /\w\W\w/; # matches a word char, followed by a 235 # non-word char, followed by a word char 236 /..rt/; # matches any two chars, followed by 'rt' 237 /end\./; # matches 'end.' 238 /end[.]/; # same thing, matches 'end.' 239 240The S<B<word anchor> > C<\b> matches a boundary between a word 241character and a non-word character C<\w\W> or C<\W\w>: 242 243 $x = "Housecat catenates house and cat"; 244 $x =~ /\bcat/; # matches cat in 'catenates' 245 $x =~ /cat\b/; # matches cat in 'housecat' 246 $x =~ /\bcat\b/; # matches 'cat' at end of string 247 248In the last example, the end of the string is considered a word 249boundary. 250 251For natural language processing (so that, for example, apostrophes are 252included in words), use instead C<\b{wb}> 253 254 "don't" =~ / .+? \b{wb} /x; # matches the whole string 255 256=head2 Matching this or that 257 258We can match different character strings with the B<alternation> 259metacharacter C<'|'>. To match C<dog> or C<cat>, we form the regex 260C<dog|cat>. As before, Perl will try to match the regex at the 261earliest possible point in the string. At each character position, 262Perl will first try to match the first alternative, C<dog>. If 263C<dog> doesn't match, Perl will then try the next alternative, C<cat>. 264If C<cat> doesn't match either, then the match fails and Perl moves to 265the next position in the string. Some examples: 266 267 "cats and dogs" =~ /cat|dog|bird/; # matches "cat" 268 "cats and dogs" =~ /dog|cat|bird/; # matches "cat" 269 270Even though C<dog> is the first alternative in the second regex, 271C<cat> is able to match earlier in the string. 272 273 "cats" =~ /c|ca|cat|cats/; # matches "c" 274 "cats" =~ /cats|cat|ca|c/; # matches "cats" 275 276At a given character position, the first alternative that allows the 277regex match to succeed will be the one that matches. Here, all the 278alternatives match at the first string position, so the first matches. 279 280=head2 Grouping things and hierarchical matching 281 282The B<grouping> metacharacters C<()> allow a part of a regex to be 283treated as a single unit. Parts of a regex are grouped by enclosing 284them in parentheses. The regex C<house(cat|keeper)> means match 285C<house> followed by either C<cat> or C<keeper>. Some more examples 286are 287 288 /(a|b)b/; # matches 'ab' or 'bb' 289 /(^a|b)c/; # matches 'ac' at start of string or 'bc' anywhere 290 291 /house(cat|)/; # matches either 'housecat' or 'house' 292 /house(cat(s|)|)/; # matches either 'housecats' or 'housecat' or 293 # 'house'. Note groups can be nested. 294 295 "20" =~ /(19|20|)\d\d/; # matches the null alternative '()\d\d', 296 # because '20\d\d' can't match 297 298=head2 Extracting matches 299 300The grouping metacharacters C<()> also allow the extraction of the 301parts of a string that matched. For each grouping, the part that 302matched inside goes into the special variables C<$1>, C<$2>, etc. 303They can be used just as ordinary variables: 304 305 # extract hours, minutes, seconds 306 $time =~ /(\d\d):(\d\d):(\d\d)/; # match hh:mm:ss format 307 $hours = $1; 308 $minutes = $2; 309 $seconds = $3; 310 311In list context, a match C</regex/> with groupings will return the 312list of matched values C<($1,$2,...)>. So we could rewrite it as 313 314 ($hours, $minutes, $second) = ($time =~ /(\d\d):(\d\d):(\d\d)/); 315 316If the groupings in a regex are nested, C<$1> gets the group with the 317leftmost opening parenthesis, C<$2> the next opening parenthesis, 318etc. For example, here is a complex regex and the matching variables 319indicated below it: 320 321 /(ab(cd|ef)((gi)|j))/; 322 1 2 34 323 324Associated with the matching variables C<$1>, C<$2>, ... are 325the B<backreferences> C<\g1>, C<\g2>, ... Backreferences are 326matching variables that can be used I<inside> a regex: 327 328 /(\w\w\w)\s\g1/; # find sequences like 'the the' in string 329 330C<$1>, C<$2>, ... should only be used outside of a regex, and C<\g1>, 331C<\g2>, ... only inside a regex. 332 333=head2 Matching repetitions 334 335The B<quantifier> metacharacters C<?>, C<*>, C<+>, and C<{}> allow us 336to determine the number of repeats of a portion of a regex we 337consider to be a match. Quantifiers are put immediately after the 338character, character class, or grouping that we want to specify. They 339have the following meanings: 340 341=over 4 342 343=item * 344 345C<a?> = match 'a' 1 or 0 times 346 347=item * 348 349C<a*> = match 'a' 0 or more times, i.e., any number of times 350 351=item * 352 353C<a+> = match 'a' 1 or more times, i.e., at least once 354 355=item * 356 357C<a{n,m}> = match at least C<n> times, but not more than C<m> 358times. 359 360=item * 361 362C<a{n,}> = match at least C<n> or more times 363 364=item * 365 366C<a{n}> = match exactly C<n> times 367 368=back 369 370Here are some examples: 371 372 /[a-z]+\s+\d*/; # match a lowercase word, at least some space, and 373 # any number of digits 374 /(\w+)\s+\g1/; # match doubled words of arbitrary length 375 $year =~ /^\d{2,4}$/; # make sure year is at least 2 but not more 376 # than 4 digits 377 $year =~ /^\d{4}$|^\d{2}$/; # better match; throw out 3 digit dates 378 379These quantifiers will try to match as much of the string as possible, 380while still allowing the regex to match. So we have 381 382 $x = 'the cat in the hat'; 383 $x =~ /^(.*)(at)(.*)$/; # matches, 384 # $1 = 'the cat in the h' 385 # $2 = 'at' 386 # $3 = '' (0 matches) 387 388The first quantifier C<.*> grabs as much of the string as possible 389while still having the regex match. The second quantifier C<.*> has 390no string left to it, so it matches 0 times. 391 392=head2 More matching 393 394There are a few more things you might want to know about matching 395operators. 396The global modifier C</g> allows the matching operator to match 397within a string as many times as possible. In scalar context, 398successive matches against a string will have C</g> jump from match 399to match, keeping track of position in the string as it goes along. 400You can get or set the position with the C<pos()> function. 401For example, 402 403 $x = "cat dog house"; # 3 words 404 while ($x =~ /(\w+)/g) { 405 print "Word is $1, ends at position ", pos $x, "\n"; 406 } 407 408prints 409 410 Word is cat, ends at position 3 411 Word is dog, ends at position 7 412 Word is house, ends at position 13 413 414A failed match or changing the target string resets the position. If 415you don't want the position reset after failure to match, add the 416C</c>, as in C</regex/gc>. 417 418In list context, C</g> returns a list of matched groupings, or if 419there are no groupings, a list of matches to the whole regex. So 420 421 @words = ($x =~ /(\w+)/g); # matches, 422 # $word[0] = 'cat' 423 # $word[1] = 'dog' 424 # $word[2] = 'house' 425 426=head2 Search and replace 427 428Search and replace is performed using C<s/regex/replacement/modifiers>. 429The C<replacement> is a Perl double-quoted string that replaces in the 430string whatever is matched with the C<regex>. The operator C<=~> is 431also used here to associate a string with C<s///>. If matching 432against C<$_>, the S<C<$_ =~>> can be dropped. If there is a match, 433C<s///> returns the number of substitutions made; otherwise it returns 434false. Here are a few examples: 435 436 $x = "Time to feed the cat!"; 437 $x =~ s/cat/hacker/; # $x contains "Time to feed the hacker!" 438 $y = "'quoted words'"; 439 $y =~ s/^'(.*)'$/$1/; # strip single quotes, 440 # $y contains "quoted words" 441 442With the C<s///> operator, the matched variables C<$1>, C<$2>, etc. 443are immediately available for use in the replacement expression. With 444the global modifier, C<s///g> will search and replace all occurrences 445of the regex in the string: 446 447 $x = "I batted 4 for 4"; 448 $x =~ s/4/four/; # $x contains "I batted four for 4" 449 $x = "I batted 4 for 4"; 450 $x =~ s/4/four/g; # $x contains "I batted four for four" 451 452The non-destructive modifier C<s///r> causes the result of the substitution 453to be returned instead of modifying C<$_> (or whatever variable the 454substitute was bound to with C<=~>): 455 456 $x = "I like dogs."; 457 $y = $x =~ s/dogs/cats/r; 458 print "$x $y\n"; # prints "I like dogs. I like cats." 459 460 $x = "Cats are great."; 461 print $x =~ s/Cats/Dogs/r =~ s/Dogs/Frogs/r =~ 462 s/Frogs/Hedgehogs/r, "\n"; 463 # prints "Hedgehogs are great." 464 465 @foo = map { s/[a-z]/X/r } qw(a b c 1 2 3); 466 # @foo is now qw(X X X 1 2 3) 467 468The evaluation modifier C<s///e> wraps an C<eval{...}> around the 469replacement string and the evaluated result is substituted for the 470matched substring. Some examples: 471 472 # reverse all the words in a string 473 $x = "the cat in the hat"; 474 $x =~ s/(\w+)/reverse $1/ge; # $x contains "eht tac ni eht tah" 475 476 # convert percentage to decimal 477 $x = "A 39% hit rate"; 478 $x =~ s!(\d+)%!$1/100!e; # $x contains "A 0.39 hit rate" 479 480The last example shows that C<s///> can use other delimiters, such as 481C<s!!!> and C<s{}{}>, and even C<s{}//>. If single quotes are used 482C<s'''>, then the regex and replacement are treated as single-quoted 483strings. 484 485=head2 The split operator 486 487C<split /regex/, string> splits C<string> into a list of substrings 488and returns that list. The regex determines the character sequence 489that C<string> is split with respect to. For example, to split a 490string into words, use 491 492 $x = "Calvin and Hobbes"; 493 @word = split /\s+/, $x; # $word[0] = 'Calvin' 494 # $word[1] = 'and' 495 # $word[2] = 'Hobbes' 496 497To extract a comma-delimited list of numbers, use 498 499 $x = "1.618,2.718, 3.142"; 500 @const = split /,\s*/, $x; # $const[0] = '1.618' 501 # $const[1] = '2.718' 502 # $const[2] = '3.142' 503 504If the empty regex C<//> is used, the string is split into individual 505characters. If the regex has groupings, then the list produced contains 506the matched substrings from the groupings as well: 507 508 $x = "/usr/bin"; 509 @parts = split m!(/)!, $x; # $parts[0] = '' 510 # $parts[1] = '/' 511 # $parts[2] = 'usr' 512 # $parts[3] = '/' 513 # $parts[4] = 'bin' 514 515Since the first character of $x matched the regex, C<split> prepended 516an empty initial element to the list. 517 518=head2 C<use re 'strict'> 519 520New in v5.22, this applies stricter rules than otherwise when compiling 521regular expression patterns. It can find things that, while legal, may 522not be what you intended. 523 524See L<'strict' in re|re/'strict' mode>. 525 526=head1 BUGS 527 528None. 529 530=head1 SEE ALSO 531 532This is just a quick start guide. For a more in-depth tutorial on 533regexes, see L<perlretut> and for the reference page, see L<perlre>. 534 535=head1 AUTHOR AND COPYRIGHT 536 537Copyright (c) 2000 Mark Kvale 538All rights reserved. 539 540This document may be distributed under the same terms as Perl itself. 541 542=head2 Acknowledgments 543 544The author would like to thank Mark-Jason Dominus, Tom Christiansen, 545Ilya Zakharevich, Brad Hughes, and Mike Giroux for all their helpful 546comments. 547 548=cut 549 550