1=head1 NAME 2 3perlrequick - Perl regular expressions quick start 4 5=head1 DESCRIPTION 6 7This page covers the very basics of understanding, creating and 8using regular expressions ('regexes') in Perl. 9 10 11=head1 The Guide 12 13=head2 Simple word matching 14 15The simplest regex is simply a word, or more generally, a string of 16characters. A regex consisting of a word matches any string that 17contains that word: 18 19 "Hello World" =~ /World/; # matches 20 21In this statement, C<World> is a regex and the C<//> enclosing 22C</World/> tells perl to search a string for a match. The operator 23C<=~> associates the string with the regex match and produces a true 24value if the regex matched, or false if the regex did not match. In 25our case, C<World> matches the second word in C<"Hello World">, so the 26expression is true. This idea has several variations. 27 28Expressions like this are useful in conditionals: 29 30 print "It matches\n" if "Hello World" =~ /World/; 31 32The sense of the match can be reversed by using C<!~> operator: 33 34 print "It doesn't match\n" if "Hello World" !~ /World/; 35 36The literal string in the regex can be replaced by a variable: 37 38 $greeting = "World"; 39 print "It matches\n" if "Hello World" =~ /$greeting/; 40 41If you're matching against C<$_>, the C<$_ =~> part can be omitted: 42 43 $_ = "Hello World"; 44 print "It matches\n" if /World/; 45 46Finally, the C<//> default delimiters for a match can be changed to 47arbitrary delimiters by putting an C<'m'> out front: 48 49 "Hello World" =~ m!World!; # matches, delimited by '!' 50 "Hello World" =~ m{World}; # matches, note the matching '{}' 51 "/usr/bin/perl" =~ m"/perl"; # matches after '/usr/bin', 52 # '/' becomes an ordinary char 53 54Regexes must match a part of the string I<exactly> in order for the 55statement to be true: 56 57 "Hello World" =~ /world/; # doesn't match, case sensitive 58 "Hello World" =~ /o W/; # matches, ' ' is an ordinary char 59 "Hello World" =~ /World /; # doesn't match, no ' ' at end 60 61perl will always match at the earliest possible point in the string: 62 63 "Hello World" =~ /o/; # matches 'o' in 'Hello' 64 "That hat is red" =~ /hat/; # matches 'hat' in 'That' 65 66Not all characters can be used 'as is' in a match. Some characters, 67called B<metacharacters>, are reserved for use in regex notation. 68The metacharacters are 69 70 {}[]()^$.|*+?\ 71 72A metacharacter can be matched by putting a backslash before it: 73 74 "2+2=4" =~ /2+2/; # doesn't match, + is a metacharacter 75 "2+2=4" =~ /2\+2/; # matches, \+ is treated like an ordinary + 76 'C:\WIN32' =~ /C:\\WIN/; # matches 77 "/usr/bin/perl" =~ /\/usr\/local\/bin\/perl/; # matches 78 79In the last regex, the forward slash C<'/'> is also backslashed, 80because it is used to delimit the regex. 81 82Non-printable ASCII characters are represented by B<escape sequences>. 83Common examples are C<\t> for a tab, C<\n> for a newline, and C<\r> 84for a carriage return. Arbitrary bytes are represented by octal 85escape sequences, e.g., C<\033>, or hexadecimal escape sequences, 86e.g., C<\x1B>: 87 88 "1000\t2000" =~ m(0\t2) # matches 89 "cat" =~ /\143\x61\x74/ # matches, but a weird way to spell cat 90 91Regexes are treated mostly as double quoted strings, so variable 92substitution works: 93 94 $foo = 'house'; 95 'cathouse' =~ /cat$foo/; # matches 96 'housecat' =~ /${foo}cat/; # matches 97 98With all of the regexes above, if the regex matched anywhere in the 99string, it was considered a match. To specify I<where> it should 100match, we would use the B<anchor> metacharacters C<^> and C<$>. The 101anchor C<^> means match at the beginning of the string and the anchor 102C<$> means match at the end of the string, or before a newline at the 103end of the string. Some examples: 104 105 "housekeeper" =~ /keeper/; # matches 106 "housekeeper" =~ /^keeper/; # doesn't match 107 "housekeeper" =~ /keeper$/; # matches 108 "housekeeper\n" =~ /keeper$/; # matches 109 "housekeeper" =~ /^housekeeper$/; # matches 110 111=head2 Using character classes 112 113A B<character class> allows a set of possible characters, rather than 114just a single character, to match at a particular point in a regex. 115Character classes are denoted by brackets C<[...]>, with the set of 116characters to be possibly matched inside. Here are some examples: 117 118 /cat/; # matches 'cat' 119 /[bcr]at/; # matches 'bat', 'cat', or 'rat' 120 "abc" =~ /[cab]/; # matches 'a' 121 122In the last statement, even though C<'c'> is the first character in 123the class, the earliest point at which the regex can match is C<'a'>. 124 125 /[yY][eE][sS]/; # match 'yes' in a case-insensitive way 126 # 'yes', 'Yes', 'YES', etc. 127 /yes/i; # also match 'yes' in a case-insensitive way 128 129The last example shows a match with an C<'i'> B<modifier>, which makes 130the match case-insensitive. 131 132Character classes also have ordinary and special characters, but the 133sets of ordinary and special characters inside a character class are 134different than those outside a character class. The special 135characters for a character class are C<-]\^$> and are matched using an 136escape: 137 138 /[\]c]def/; # matches ']def' or 'cdef' 139 $x = 'bcr'; 140 /[$x]at/; # matches 'bat, 'cat', or 'rat' 141 /[\$x]at/; # matches '$at' or 'xat' 142 /[\\$x]at/; # matches '\at', 'bat, 'cat', or 'rat' 143 144The special character C<'-'> acts as a range operator within character 145classes, so that the unwieldy C<[0123456789]> and C<[abc...xyz]> 146become the svelte C<[0-9]> and C<[a-z]>: 147 148 /item[0-9]/; # matches 'item0' or ... or 'item9' 149 /[0-9a-fA-F]/; # matches a hexadecimal digit 150 151If C<'-'> is the first or last character in a character class, it is 152treated as an ordinary character. 153 154The special character C<^> in the first position of a character class 155denotes a B<negated character class>, which matches any character but 156those in the brackets. Both C<[...]> and C<[^...]> must match a 157character, or the match fails. Then 158 159 /[^a]at/; # doesn't match 'aat' or 'at', but matches 160 # all other 'bat', 'cat, '0at', '%at', etc. 161 /[^0-9]/; # matches a non-numeric character 162 /[a^]at/; # matches 'aat' or '^at'; here '^' is ordinary 163 164Perl has several abbreviations for common character classes: 165 166=over 4 167 168=item * 169 170\d is a digit and represents [0-9] 171 172=item * 173 174\s is a whitespace character and represents [\ \t\r\n\f] 175 176=item * 177 178\w is a word character (alphanumeric or _) and represents [0-9a-zA-Z_] 179 180=item * 181 182\D is a negated \d; it represents any character but a digit [^0-9] 183 184=item * 185 186\S is a negated \s; it represents any non-whitespace character [^\s] 187 188=item * 189 190\W is a negated \w; it represents any non-word character [^\w] 191 192=item * 193 194The period '.' matches any character but "\n" 195 196=back 197 198The C<\d\s\w\D\S\W> abbreviations can be used both inside and outside 199of character classes. Here are some in use: 200 201 /\d\d:\d\d:\d\d/; # matches a hh:mm:ss time format 202 /[\d\s]/; # matches any digit or whitespace character 203 /\w\W\w/; # matches a word char, followed by a 204 # non-word char, followed by a word char 205 /..rt/; # matches any two chars, followed by 'rt' 206 /end\./; # matches 'end.' 207 /end[.]/; # same thing, matches 'end.' 208 209The S<B<word anchor> > C<\b> matches a boundary between a word 210character and a non-word character C<\w\W> or C<\W\w>: 211 212 $x = "Housecat catenates house and cat"; 213 $x =~ /\bcat/; # matches cat in 'catenates' 214 $x =~ /cat\b/; # matches cat in 'housecat' 215 $x =~ /\bcat\b/; # matches 'cat' at end of string 216 217In the last example, the end of the string is considered a word 218boundary. 219 220=head2 Matching this or that 221 222We can match match different character strings with the B<alternation> 223metacharacter C<'|'>. To match C<dog> or C<cat>, we form the regex 224C<dog|cat>. As before, perl will try to match the regex at the 225earliest possible point in the string. At each character position, 226perl will first try to match the the first alternative, C<dog>. If 227C<dog> doesn't match, perl will then try the next alternative, C<cat>. 228If C<cat> doesn't match either, then the match fails and perl moves to 229the next position in the string. Some examples: 230 231 "cats and dogs" =~ /cat|dog|bird/; # matches "cat" 232 "cats and dogs" =~ /dog|cat|bird/; # matches "cat" 233 234Even though C<dog> is the first alternative in the second regex, 235C<cat> is able to match earlier in the string. 236 237 "cats" =~ /c|ca|cat|cats/; # matches "c" 238 "cats" =~ /cats|cat|ca|c/; # matches "cats" 239 240At a given character position, the first alternative that allows the 241regex match to succeed wil be the one that matches. Here, all the 242alternatives match at the first string position, so th first matches. 243 244=head2 Grouping things and hierarchical matching 245 246The B<grouping> metacharacters C<()> allow a part of a regex to be 247treated as a single unit. Parts of a regex are grouped by enclosing 248them in parentheses. The regex C<house(cat|keeper)> means match 249C<house> followed by either C<cat> or C<keeper>. Some more examples 250are 251 252 /(a|b)b/; # matches 'ab' or 'bb' 253 /(^a|b)c/; # matches 'ac' at start of string or 'bc' anywhere 254 255 /house(cat|)/; # matches either 'housecat' or 'house' 256 /house(cat(s|)|)/; # matches either 'housecats' or 'housecat' or 257 # 'house'. Note groups can be nested. 258 259 "20" =~ /(19|20|)\d\d/; # matches the null alternative '()\d\d', 260 # because '20\d\d' can't match 261 262=head2 Extracting matches 263 264The grouping metacharacters C<()> also allow the extraction of the 265parts of a string that matched. For each grouping, the part that 266matched inside goes into the special variables C<$1>, C<$2>, etc. 267They can be used just as ordinary variables: 268 269 # extract hours, minutes, seconds 270 $time =~ /(\d\d):(\d\d):(\d\d)/; # match hh:mm:ss format 271 $hours = $1; 272 $minutes = $2; 273 $seconds = $3; 274 275In list context, a match C</regex/> with groupings will return the 276list of matched values C<($1,$2,...)>. So we could rewrite it as 277 278 ($hours, $minutes, $second) = ($time =~ /(\d\d):(\d\d):(\d\d)/); 279 280If the groupings in a regex are nested, C<$1> gets the group with the 281leftmost opening parenthesis, C<$2> the next opening parenthesis, 282etc. For example, here is a complex regex and the matching variables 283indicated below it: 284 285 /(ab(cd|ef)((gi)|j))/; 286 1 2 34 287 288Associated with the matching variables C<$1>, C<$2>, ... are 289the B<backreferences> C<\1>, C<\2>, ... Backreferences are 290matching variables that can be used I<inside> a regex: 291 292 /(\w\w\w)\s\1/; # find sequences like 'the the' in string 293 294C<$1>, C<$2>, ... should only be used outside of a regex, and C<\1>, 295C<\2>, ... only inside a regex. 296 297=head2 Matching repetitions 298 299The B<quantifier> metacharacters C<?>, C<*>, C<+>, and C<{}> allow us 300to determine the number of repeats of a portion of a regex we 301consider to be a match. Quantifiers are put immediately after the 302character, character class, or grouping that we want to specify. They 303have the following meanings: 304 305=over 4 306 307=item * 308 309C<a?> = match 'a' 1 or 0 times 310 311=item * 312 313C<a*> = match 'a' 0 or more times, i.e., any number of times 314 315=item * 316 317C<a+> = match 'a' 1 or more times, i.e., at least once 318 319=item * 320 321C<a{n,m}> = match at least C<n> times, but not more than C<m> 322times. 323 324=item * 325 326C<a{n,}> = match at least C<n> or more times 327 328=item * 329 330C<a{n}> = match exactly C<n> times 331 332=back 333 334Here are some examples: 335 336 /[a-z]+\s+\d*/; # match a lowercase word, at least some space, and 337 # any number of digits 338 /(\w+)\s+\1/; # match doubled words of arbitrary length 339 $year =~ /\d{2,4}/; # make sure year is at least 2 but not more 340 # than 4 digits 341 $year =~ /\d{4}|\d{2}/; # better match; throw out 3 digit dates 342 343These quantifiers will try to match as much of the string as possible, 344while still allowing the regex to match. So we have 345 346 $x = 'the cat in the hat'; 347 $x =~ /^(.*)(at)(.*)$/; # matches, 348 # $1 = 'the cat in the h' 349 # $2 = 'at' 350 # $3 = '' (0 matches) 351 352The first quantifier C<.*> grabs as much of the string as possible 353while still having the regex match. The second quantifier C<.*> has 354no string left to it, so it matches 0 times. 355 356=head2 More matching 357 358There are a few more things you might want to know about matching 359operators. In the code 360 361 $pattern = 'Seuss'; 362 while (<>) { 363 print if /$pattern/; 364 } 365 366perl has to re-evaluate C<$pattern> each time through the loop. If 367C<$pattern> won't be changing, use the C<//o> modifier, to only 368perform variable substitutions once. If you don't want any 369substitutions at all, use the special delimiter C<m''>: 370 371 $pattern = 'Seuss'; 372 m'$pattern'; # matches '$pattern', not 'Seuss' 373 374The global modifier C<//g> allows the matching operator to match 375within a string as many times as possible. In scalar context, 376successive matches against a string will have C<//g> jump from match 377to match, keeping track of position in the string as it goes along. 378You can get or set the position with the C<pos()> function. 379For example, 380 381 $x = "cat dog house"; # 3 words 382 while ($x =~ /(\w+)/g) { 383 print "Word is $1, ends at position ", pos $x, "\n"; 384 } 385 386prints 387 388 Word is cat, ends at position 3 389 Word is dog, ends at position 7 390 Word is house, ends at position 13 391 392A failed match or changing the target string resets the position. If 393you don't want the position reset after failure to match, add the 394C<//c>, as in C</regex/gc>. 395 396In list context, C<//g> returns a list of matched groupings, or if 397there are no groupings, a list of matches to the whole regex. So 398 399 @words = ($x =~ /(\w+)/g); # matches, 400 # $word[0] = 'cat' 401 # $word[1] = 'dog' 402 # $word[2] = 'house' 403 404=head2 Search and replace 405 406Search and replace is performed using C<s/regex/replacement/modifiers>. 407The C<replacement> is a Perl double quoted string that replaces in the 408string whatever is matched with the C<regex>. The operator C<=~> is 409also used here to associate a string with C<s///>. If matching 410against C<$_>, the S<C<$_ =~> > can be dropped. If there is a match, 411C<s///> returns the number of substitutions made, otherwise it returns 412false. Here are a few examples: 413 414 $x = "Time to feed the cat!"; 415 $x =~ s/cat/hacker/; # $x contains "Time to feed the hacker!" 416 $y = "'quoted words'"; 417 $y =~ s/^'(.*)'$/$1/; # strip single quotes, 418 # $y contains "quoted words" 419 420With the C<s///> operator, the matched variables C<$1>, C<$2>, etc. 421are immediately available for use in the replacement expression. With 422the global modifier, C<s///g> will search and replace all occurrences 423of the regex in the string: 424 425 $x = "I batted 4 for 4"; 426 $x =~ s/4/four/; # $x contains "I batted four for 4" 427 $x = "I batted 4 for 4"; 428 $x =~ s/4/four/g; # $x contains "I batted four for four" 429 430The evaluation modifier C<s///e> wraps an C<eval{...}> around the 431replacement string and the evaluated result is substituted for the 432matched substring. Some examples: 433 434 # reverse all the words in a string 435 $x = "the cat in the hat"; 436 $x =~ s/(\w+)/reverse $1/ge; # $x contains "eht tac ni eht tah" 437 438 # convert percentage to decimal 439 $x = "A 39% hit rate"; 440 $x =~ s!(\d+)%!$1/100!e; # $x contains "A 0.39 hit rate" 441 442The last example shows that C<s///> can use other delimiters, such as 443C<s!!!> and C<s{}{}>, and even C<s{}//>. If single quotes are used 444C<s'''>, then the regex and replacement are treated as single quoted 445strings. 446 447=head2 The split operator 448 449C<split /regex/, string> splits C<string> into a list of substrings 450and returns that list. The regex determines the character sequence 451that C<string> is split with respect to. For example, to split a 452string into words, use 453 454 $x = "Calvin and Hobbes"; 455 @word = split /\s+/, $x; # $word[0] = 'Calvin' 456 # $word[1] = 'and' 457 # $word[2] = 'Hobbes' 458 459To extract a comma-delimited list of numbers, use 460 461 $x = "1.618,2.718, 3.142"; 462 @const = split /,\s*/, $x; # $const[0] = '1.618' 463 # $const[1] = '2.718' 464 # $const[2] = '3.142' 465 466If the empty regex C<//> is used, the string is split into individual 467characters. If the regex has groupings, then list produced contains 468the matched substrings from the groupings as well: 469 470 $x = "/usr/bin"; 471 @parts = split m!(/)!, $x; # $parts[0] = '' 472 # $parts[1] = '/' 473 # $parts[2] = 'usr' 474 # $parts[3] = '/' 475 # $parts[4] = 'bin' 476 477Since the first character of $x matched the regex, C<split> prepended 478an empty initial element to the list. 479 480=head1 BUGS 481 482None. 483 484=head1 SEE ALSO 485 486This is just a quick start guide. For a more in-depth tutorial on 487regexes, see L<perlretut> and for the reference page, see L<perlre>. 488 489=head1 AUTHOR AND COPYRIGHT 490 491Copyright (c) 2000 Mark Kvale 492All rights reserved. 493 494This document may be distributed under the same terms as Perl itself. 495 496=head2 Acknowledgments 497 498The author would like to thank Mark-Jason Dominus, Tom Christiansen, 499Ilya Zakharevich, Brad Hughes, and Mike Giroux for all their helpful 500comments. 501 502=cut 503 504