1=head1 NAME 2 3perlrequick - Perl regular expressions quick start 4 5=head1 DESCRIPTION 6 7This page covers the very basics of understanding, creating and 8using regular expressions ('regexes') in Perl. 9 10 11=head1 The Guide 12 13=head2 Simple word matching 14 15The simplest regex is simply a word, or more generally, a string of 16characters. A regex consisting of a word matches any string that 17contains that word: 18 19 "Hello World" =~ /World/; # matches 20 21In this statement, C<World> is a regex and the C<//> enclosing 22C</World/> tells Perl to search a string for a match. The operator 23C<=~> associates the string with the regex match and produces a true 24value if the regex matched, or false if the regex did not match. In 25our case, C<World> matches the second word in C<"Hello World">, so the 26expression is true. This idea has several variations. 27 28Expressions like this are useful in conditionals: 29 30 print "It matches\n" if "Hello World" =~ /World/; 31 32The sense of the match can be reversed by using C<!~> operator: 33 34 print "It doesn't match\n" if "Hello World" !~ /World/; 35 36The literal string in the regex can be replaced by a variable: 37 38 $greeting = "World"; 39 print "It matches\n" if "Hello World" =~ /$greeting/; 40 41If you're matching against C<$_>, the C<$_ =~> part can be omitted: 42 43 $_ = "Hello World"; 44 print "It matches\n" if /World/; 45 46Finally, the C<//> default delimiters for a match can be changed to 47arbitrary delimiters by putting an C<'m'> out front: 48 49 "Hello World" =~ m!World!; # matches, delimited by '!' 50 "Hello World" =~ m{World}; # matches, note the matching '{}' 51 "/usr/bin/perl" =~ m"/perl"; # matches after '/usr/bin', 52 # '/' becomes an ordinary char 53 54Regexes must match a part of the string I<exactly> in order for the 55statement to be true: 56 57 "Hello World" =~ /world/; # doesn't match, case sensitive 58 "Hello World" =~ /o W/; # matches, ' ' is an ordinary char 59 "Hello World" =~ /World /; # doesn't match, no ' ' at end 60 61Perl will always match at the earliest possible point in the string: 62 63 "Hello World" =~ /o/; # matches 'o' in 'Hello' 64 "That hat is red" =~ /hat/; # matches 'hat' in 'That' 65 66Not all characters can be used 'as is' in a match. Some characters, 67called B<metacharacters>, are reserved for use in regex notation. 68The metacharacters are 69 70 {}[]()^$.|*+?\ 71 72A metacharacter can be matched by putting a backslash before it: 73 74 "2+2=4" =~ /2+2/; # doesn't match, + is a metacharacter 75 "2+2=4" =~ /2\+2/; # matches, \+ is treated like an ordinary + 76 'C:\WIN32' =~ /C:\\WIN/; # matches 77 "/usr/bin/perl" =~ /\/usr\/bin\/perl/; # matches 78 79In the last regex, the forward slash C<'/'> is also backslashed, 80because it is used to delimit the regex. 81 82Non-printable ASCII characters are represented by B<escape sequences>. 83Common examples are C<\t> for a tab, C<\n> for a newline, and C<\r> 84for a carriage return. Arbitrary bytes are represented by octal 85escape sequences, e.g., C<\033>, or hexadecimal escape sequences, 86e.g., C<\x1B>: 87 88 "1000\t2000" =~ m(0\t2) # matches 89 "cat" =~ /\143\x61\x74/ # matches in ASCII, but 90 # a weird way to spell cat 91 92Regexes are treated mostly as double-quoted strings, so variable 93substitution works: 94 95 $foo = 'house'; 96 'cathouse' =~ /cat$foo/; # matches 97 'housecat' =~ /${foo}cat/; # matches 98 99With all of the regexes above, if the regex matched anywhere in the 100string, it was considered a match. To specify I<where> it should 101match, we would use the B<anchor> metacharacters C<^> and C<$>. The 102anchor C<^> means match at the beginning of the string and the anchor 103C<$> means match at the end of the string, or before a newline at the 104end of the string. Some examples: 105 106 "housekeeper" =~ /keeper/; # matches 107 "housekeeper" =~ /^keeper/; # doesn't match 108 "housekeeper" =~ /keeper$/; # matches 109 "housekeeper\n" =~ /keeper$/; # matches 110 "housekeeper" =~ /^housekeeper$/; # matches 111 112=head2 Using character classes 113 114A B<character class> allows a set of possible characters, rather than 115just a single character, to match at a particular point in a regex. 116Character classes are denoted by brackets C<[...]>, with the set of 117characters to be possibly matched inside. Here are some examples: 118 119 /cat/; # matches 'cat' 120 /[bcr]at/; # matches 'bat', 'cat', or 'rat' 121 "abc" =~ /[cab]/; # matches 'a' 122 123In the last statement, even though C<'c'> is the first character in 124the class, the earliest point at which the regex can match is C<'a'>. 125 126 /[yY][eE][sS]/; # match 'yes' in a case-insensitive way 127 # 'yes', 'Yes', 'YES', etc. 128 /yes/i; # also match 'yes' in a case-insensitive way 129 130The last example shows a match with an C<'i'> B<modifier>, which makes 131the match case-insensitive. 132 133Character classes also have ordinary and special characters, but the 134sets of ordinary and special characters inside a character class are 135different than those outside a character class. The special 136characters for a character class are C<-]\^$> and are matched using an 137escape: 138 139 /[\]c]def/; # matches ']def' or 'cdef' 140 $x = 'bcr'; 141 /[$x]at/; # matches 'bat, 'cat', or 'rat' 142 /[\$x]at/; # matches '$at' or 'xat' 143 /[\\$x]at/; # matches '\at', 'bat, 'cat', or 'rat' 144 145The special character C<'-'> acts as a range operator within character 146classes, so that the unwieldy C<[0123456789]> and C<[abc...xyz]> 147become the svelte C<[0-9]> and C<[a-z]>: 148 149 /item[0-9]/; # matches 'item0' or ... or 'item9' 150 /[0-9a-fA-F]/; # matches a hexadecimal digit 151 152If C<'-'> is the first or last character in a character class, it is 153treated as an ordinary character. 154 155The special character C<^> in the first position of a character class 156denotes a B<negated character class>, which matches any character but 157those in the brackets. Both C<[...]> and C<[^...]> must match a 158character, or the match fails. Then 159 160 /[^a]at/; # doesn't match 'aat' or 'at', but matches 161 # all other 'bat', 'cat, '0at', '%at', etc. 162 /[^0-9]/; # matches a non-numeric character 163 /[a^]at/; # matches 'aat' or '^at'; here '^' is ordinary 164 165Perl has several abbreviations for common character classes. (These 166definitions are those that Perl uses in ASCII-safe mode with the C</a> modifier. 167Otherwise they could match many more non-ASCII Unicode characters as 168well. See L<perlrecharclass/Backslash sequences> for details.) 169 170=over 4 171 172=item * 173 174\d is a digit and represents 175 176 [0-9] 177 178=item * 179 180\s is a whitespace character and represents 181 182 [\ \t\r\n\f] 183 184=item * 185 186\w is a word character (alphanumeric or _) and represents 187 188 [0-9a-zA-Z_] 189 190=item * 191 192\D is a negated \d; it represents any character but a digit 193 194 [^0-9] 195 196=item * 197 198\S is a negated \s; it represents any non-whitespace character 199 200 [^\s] 201 202=item * 203 204\W is a negated \w; it represents any non-word character 205 206 [^\w] 207 208=item * 209 210The period '.' matches any character but "\n" 211 212=back 213 214The C<\d\s\w\D\S\W> abbreviations can be used both inside and outside 215of character classes. Here are some in use: 216 217 /\d\d:\d\d:\d\d/; # matches a hh:mm:ss time format 218 /[\d\s]/; # matches any digit or whitespace character 219 /\w\W\w/; # matches a word char, followed by a 220 # non-word char, followed by a word char 221 /..rt/; # matches any two chars, followed by 'rt' 222 /end\./; # matches 'end.' 223 /end[.]/; # same thing, matches 'end.' 224 225The S<B<word anchor> > C<\b> matches a boundary between a word 226character and a non-word character C<\w\W> or C<\W\w>: 227 228 $x = "Housecat catenates house and cat"; 229 $x =~ /\bcat/; # matches cat in 'catenates' 230 $x =~ /cat\b/; # matches cat in 'housecat' 231 $x =~ /\bcat\b/; # matches 'cat' at end of string 232 233In the last example, the end of the string is considered a word 234boundary. 235 236=head2 Matching this or that 237 238We can match different character strings with the B<alternation> 239metacharacter C<'|'>. To match C<dog> or C<cat>, we form the regex 240C<dog|cat>. As before, Perl will try to match the regex at the 241earliest possible point in the string. At each character position, 242Perl will first try to match the first alternative, C<dog>. If 243C<dog> doesn't match, Perl will then try the next alternative, C<cat>. 244If C<cat> doesn't match either, then the match fails and Perl moves to 245the next position in the string. Some examples: 246 247 "cats and dogs" =~ /cat|dog|bird/; # matches "cat" 248 "cats and dogs" =~ /dog|cat|bird/; # matches "cat" 249 250Even though C<dog> is the first alternative in the second regex, 251C<cat> is able to match earlier in the string. 252 253 "cats" =~ /c|ca|cat|cats/; # matches "c" 254 "cats" =~ /cats|cat|ca|c/; # matches "cats" 255 256At a given character position, the first alternative that allows the 257regex match to succeed will be the one that matches. Here, all the 258alternatives match at the first string position, so the first matches. 259 260=head2 Grouping things and hierarchical matching 261 262The B<grouping> metacharacters C<()> allow a part of a regex to be 263treated as a single unit. Parts of a regex are grouped by enclosing 264them in parentheses. The regex C<house(cat|keeper)> means match 265C<house> followed by either C<cat> or C<keeper>. Some more examples 266are 267 268 /(a|b)b/; # matches 'ab' or 'bb' 269 /(^a|b)c/; # matches 'ac' at start of string or 'bc' anywhere 270 271 /house(cat|)/; # matches either 'housecat' or 'house' 272 /house(cat(s|)|)/; # matches either 'housecats' or 'housecat' or 273 # 'house'. Note groups can be nested. 274 275 "20" =~ /(19|20|)\d\d/; # matches the null alternative '()\d\d', 276 # because '20\d\d' can't match 277 278=head2 Extracting matches 279 280The grouping metacharacters C<()> also allow the extraction of the 281parts of a string that matched. For each grouping, the part that 282matched inside goes into the special variables C<$1>, C<$2>, etc. 283They can be used just as ordinary variables: 284 285 # extract hours, minutes, seconds 286 $time =~ /(\d\d):(\d\d):(\d\d)/; # match hh:mm:ss format 287 $hours = $1; 288 $minutes = $2; 289 $seconds = $3; 290 291In list context, a match C</regex/> with groupings will return the 292list of matched values C<($1,$2,...)>. So we could rewrite it as 293 294 ($hours, $minutes, $second) = ($time =~ /(\d\d):(\d\d):(\d\d)/); 295 296If the groupings in a regex are nested, C<$1> gets the group with the 297leftmost opening parenthesis, C<$2> the next opening parenthesis, 298etc. For example, here is a complex regex and the matching variables 299indicated below it: 300 301 /(ab(cd|ef)((gi)|j))/; 302 1 2 34 303 304Associated with the matching variables C<$1>, C<$2>, ... are 305the B<backreferences> C<\g1>, C<\g2>, ... Backreferences are 306matching variables that can be used I<inside> a regex: 307 308 /(\w\w\w)\s\g1/; # find sequences like 'the the' in string 309 310C<$1>, C<$2>, ... should only be used outside of a regex, and C<\g1>, 311C<\g2>, ... only inside a regex. 312 313=head2 Matching repetitions 314 315The B<quantifier> metacharacters C<?>, C<*>, C<+>, and C<{}> allow us 316to determine the number of repeats of a portion of a regex we 317consider to be a match. Quantifiers are put immediately after the 318character, character class, or grouping that we want to specify. They 319have the following meanings: 320 321=over 4 322 323=item * 324 325C<a?> = match 'a' 1 or 0 times 326 327=item * 328 329C<a*> = match 'a' 0 or more times, i.e., any number of times 330 331=item * 332 333C<a+> = match 'a' 1 or more times, i.e., at least once 334 335=item * 336 337C<a{n,m}> = match at least C<n> times, but not more than C<m> 338times. 339 340=item * 341 342C<a{n,}> = match at least C<n> or more times 343 344=item * 345 346C<a{n}> = match exactly C<n> times 347 348=back 349 350Here are some examples: 351 352 /[a-z]+\s+\d*/; # match a lowercase word, at least some space, and 353 # any number of digits 354 /(\w+)\s+\g1/; # match doubled words of arbitrary length 355 $year =~ /^\d{2,4}$/; # make sure year is at least 2 but not more 356 # than 4 digits 357 $year =~ /^\d{4}$|^\d{2}$/; # better match; throw out 3 digit dates 358 359These quantifiers will try to match as much of the string as possible, 360while still allowing the regex to match. So we have 361 362 $x = 'the cat in the hat'; 363 $x =~ /^(.*)(at)(.*)$/; # matches, 364 # $1 = 'the cat in the h' 365 # $2 = 'at' 366 # $3 = '' (0 matches) 367 368The first quantifier C<.*> grabs as much of the string as possible 369while still having the regex match. The second quantifier C<.*> has 370no string left to it, so it matches 0 times. 371 372=head2 More matching 373 374There are a few more things you might want to know about matching 375operators. 376The global modifier C<//g> allows the matching operator to match 377within a string as many times as possible. In scalar context, 378successive matches against a string will have C<//g> jump from match 379to match, keeping track of position in the string as it goes along. 380You can get or set the position with the C<pos()> function. 381For example, 382 383 $x = "cat dog house"; # 3 words 384 while ($x =~ /(\w+)/g) { 385 print "Word is $1, ends at position ", pos $x, "\n"; 386 } 387 388prints 389 390 Word is cat, ends at position 3 391 Word is dog, ends at position 7 392 Word is house, ends at position 13 393 394A failed match or changing the target string resets the position. If 395you don't want the position reset after failure to match, add the 396C<//c>, as in C</regex/gc>. 397 398In list context, C<//g> returns a list of matched groupings, or if 399there are no groupings, a list of matches to the whole regex. So 400 401 @words = ($x =~ /(\w+)/g); # matches, 402 # $word[0] = 'cat' 403 # $word[1] = 'dog' 404 # $word[2] = 'house' 405 406=head2 Search and replace 407 408Search and replace is performed using C<s/regex/replacement/modifiers>. 409The C<replacement> is a Perl double-quoted string that replaces in the 410string whatever is matched with the C<regex>. The operator C<=~> is 411also used here to associate a string with C<s///>. If matching 412against C<$_>, the S<C<$_ =~>> can be dropped. If there is a match, 413C<s///> returns the number of substitutions made; otherwise it returns 414false. Here are a few examples: 415 416 $x = "Time to feed the cat!"; 417 $x =~ s/cat/hacker/; # $x contains "Time to feed the hacker!" 418 $y = "'quoted words'"; 419 $y =~ s/^'(.*)'$/$1/; # strip single quotes, 420 # $y contains "quoted words" 421 422With the C<s///> operator, the matched variables C<$1>, C<$2>, etc. 423are immediately available for use in the replacement expression. With 424the global modifier, C<s///g> will search and replace all occurrences 425of the regex in the string: 426 427 $x = "I batted 4 for 4"; 428 $x =~ s/4/four/; # $x contains "I batted four for 4" 429 $x = "I batted 4 for 4"; 430 $x =~ s/4/four/g; # $x contains "I batted four for four" 431 432The non-destructive modifier C<s///r> causes the result of the substitution 433to be returned instead of modifying C<$_> (or whatever variable the 434substitute was bound to with C<=~>): 435 436 $x = "I like dogs."; 437 $y = $x =~ s/dogs/cats/r; 438 print "$x $y\n"; # prints "I like dogs. I like cats." 439 440 $x = "Cats are great."; 441 print $x =~ s/Cats/Dogs/r =~ s/Dogs/Frogs/r =~ 442 s/Frogs/Hedgehogs/r, "\n"; 443 # prints "Hedgehogs are great." 444 445 @foo = map { s/[a-z]/X/r } qw(a b c 1 2 3); 446 # @foo is now qw(X X X 1 2 3) 447 448The evaluation modifier C<s///e> wraps an C<eval{...}> around the 449replacement string and the evaluated result is substituted for the 450matched substring. Some examples: 451 452 # reverse all the words in a string 453 $x = "the cat in the hat"; 454 $x =~ s/(\w+)/reverse $1/ge; # $x contains "eht tac ni eht tah" 455 456 # convert percentage to decimal 457 $x = "A 39% hit rate"; 458 $x =~ s!(\d+)%!$1/100!e; # $x contains "A 0.39 hit rate" 459 460The last example shows that C<s///> can use other delimiters, such as 461C<s!!!> and C<s{}{}>, and even C<s{}//>. If single quotes are used 462C<s'''>, then the regex and replacement are treated as single-quoted 463strings. 464 465=head2 The split operator 466 467C<split /regex/, string> splits C<string> into a list of substrings 468and returns that list. The regex determines the character sequence 469that C<string> is split with respect to. For example, to split a 470string into words, use 471 472 $x = "Calvin and Hobbes"; 473 @word = split /\s+/, $x; # $word[0] = 'Calvin' 474 # $word[1] = 'and' 475 # $word[2] = 'Hobbes' 476 477To extract a comma-delimited list of numbers, use 478 479 $x = "1.618,2.718, 3.142"; 480 @const = split /,\s*/, $x; # $const[0] = '1.618' 481 # $const[1] = '2.718' 482 # $const[2] = '3.142' 483 484If the empty regex C<//> is used, the string is split into individual 485characters. If the regex has groupings, then the list produced contains 486the matched substrings from the groupings as well: 487 488 $x = "/usr/bin"; 489 @parts = split m!(/)!, $x; # $parts[0] = '' 490 # $parts[1] = '/' 491 # $parts[2] = 'usr' 492 # $parts[3] = '/' 493 # $parts[4] = 'bin' 494 495Since the first character of $x matched the regex, C<split> prepended 496an empty initial element to the list. 497 498=head1 BUGS 499 500None. 501 502=head1 SEE ALSO 503 504This is just a quick start guide. For a more in-depth tutorial on 505regexes, see L<perlretut> and for the reference page, see L<perlre>. 506 507=head1 AUTHOR AND COPYRIGHT 508 509Copyright (c) 2000 Mark Kvale 510All rights reserved. 511 512This document may be distributed under the same terms as Perl itself. 513 514=head2 Acknowledgments 515 516The author would like to thank Mark-Jason Dominus, Tom Christiansen, 517Ilya Zakharevich, Brad Hughes, and Mike Giroux for all their helpful 518comments. 519 520=cut 521 522