1=head1 NAME 2X<character class> 3 4perlrecharclass - Perl Regular Expression Character Classes 5 6=head1 DESCRIPTION 7 8The top level documentation about Perl regular expressions 9is found in L<perlre>. 10 11This manual page discusses the syntax and use of character 12classes in Perl Regular Expressions. 13 14A character class is a way of denoting a set of characters, 15in such a way that one character of the set is matched. 16It's important to remember that matching a character class 17consumes exactly one character in the source string. (The source 18string is the string the regular expression is matched against.) 19 20There are three types of character classes in Perl regular 21expressions: the dot, backslashed sequences, and the form enclosed in square 22brackets. Keep in mind, though, that often the term "character class" is used 23to mean just the bracketed form. This is true in other Perl documentation. 24 25=head2 The dot 26 27The dot (or period), C<.> is probably the most used, and certainly 28the most well-known character class. By default, a dot matches any 29character, except for the newline. The default can be changed to 30add matching the newline with the I<single line> modifier: either 31for the entire regular expression using the C</s> modifier, or 32locally using C<(?s)>. 33 34Here are some examples: 35 36 "a" =~ /./ # Match 37 "." =~ /./ # Match 38 "" =~ /./ # No match (dot has to match a character) 39 "\n" =~ /./ # No match (dot does not match a newline) 40 "\n" =~ /./s # Match (global 'single line' modifier) 41 "\n" =~ /(?s:.)/ # Match (local 'single line' modifier) 42 "ab" =~ /^.$/ # No match (dot matches one character) 43 44=head2 Backslashed sequences 45X<\w> X<\W> X<\s> X<\S> X<\d> X<\D> X<\p> X<\P> 46X<\N> X<\v> X<\V> X<\h> X<\H> 47X<word> X<whitespace> 48 49Perl regular expressions contain many backslashed sequences that 50constitute a character class. That is, they will match a single 51character, if that character belongs to a specific set of characters 52(defined by the sequence). A backslashed sequence is a sequence of 53characters starting with a backslash. Not all backslashed sequences 54are character classes; for a full list, see L<perlrebackslash>. 55 56Here's a list of the backslashed sequences that are character classes. They 57are discussed in more detail below. 58 59 \d Match a digit character. 60 \D Match a non-digit character. 61 \w Match a "word" character. 62 \W Match a non-"word" character. 63 \s Match a whitespace character. 64 \S Match a non-whitespace character. 65 \h Match a horizontal whitespace character. 66 \H Match a character that isn't horizontal whitespace. 67 \N Match a character that isn't newline. Experimental. 68 \v Match a vertical whitespace character. 69 \V Match a character that isn't vertical whitespace. 70 \pP, \p{Prop} Match a character matching a Unicode property. 71 \PP, \P{Prop} Match a character that doesn't match a Unicode property. 72 73=head3 Digits 74 75C<\d> matches a single character that is considered to be a I<digit>. What is 76considered a digit depends on the internal encoding of the source string and 77the locale that is in effect. If the source string is in UTF-8 format, C<\d> 78not only matches the digits '0' - '9', but also Arabic, Devanagari and digits 79from other languages. Otherwise, if there is a locale in effect, it will match 80whatever characters the locale considers digits. Without a locale, C<\d> 81matches the digits '0' to '9'. See L</Locale, EBCDIC, Unicode and UTF-8>. 82 83Any character that isn't matched by C<\d> will be matched by C<\D>. 84 85=head3 Word characters 86 87A C<\w> matches a single alphanumeric character (an alphabetic character, or a 88decimal digit) or an underscore (C<_>), not a whole word. Use C<\w+> to match 89a string of Perl-identifier characters (which isn't the same as matching an 90English word). What is considered a word character depends on the internal 91encoding of the string and the locale or EBCDIC code page that is in effect. If 92it's in UTF-8 format, C<\w> matches those characters that are considered word 93characters in the Unicode database. That is, it not only matches ASCII letters, 94but also Thai letters, Greek letters, etc. If the source string isn't in UTF-8 95format, C<\w> matches those characters that are considered word characters by 96the current locale or EBCDIC code page. Without a locale or EBCDIC code page, 97C<\w> matches the ASCII letters, digits and the underscore. 98See L</Locale, EBCDIC, Unicode and UTF-8>. 99 100Any character that isn't matched by C<\w> will be matched by C<\W>. 101 102=head3 Whitespace 103 104C<\s> matches any single character that is considered whitespace. In the ASCII 105range, C<\s> matches the horizontal tab (C<\t>), the new line (C<\n>), the form 106feed (C<\f>), the carriage return (C<\r>), and the space. (The vertical tab, 107C<\cK> is not matched by C<\s>.) The exact set of characters matched by C<\s> 108depends on whether the source string is in UTF-8 format and the locale or 109EBCDIC code page that is in effect. If it's in UTF-8 format, C<\s> matches what 110is considered whitespace in the Unicode database; the complete list is in the 111table below. Otherwise, if there is a locale or EBCDIC code page in effect, 112C<\s> matches whatever is considered whitespace by the current locale or EBCDIC 113code page. Without a locale or EBCDIC code page, C<\s> matches the five 114characters mentioned in the beginning of this paragraph. Perhaps the most 115notable possible surprise is that C<\s> matches a non-breaking space only if 116the non-breaking space is in a UTF-8 encoded string or the locale or EBCDIC 117code page that is in effect has that character. 118See L</Locale, EBCDIC, Unicode and UTF-8>. 119 120Any character that isn't matched by C<\s> will be matched by C<\S>. 121 122C<\h> will match any character that is considered horizontal whitespace; 123this includes the space and the tab characters and 17 other characters that are 124listed in the table below. C<\H> will match any character 125that is not considered horizontal whitespace. 126 127C<\N> is new in 5.12, and is experimental. It, like the dot, will match any 128character that is not a newline. The difference is that C<\N> will not be 129influenced by the single line C</s> regular expression modifier. Note that 130there is a second meaning of C<\N> when of the form C<\N{...}>. This form is 131for named characters. See L<charnames> for those. If C<\N> is followed by an 132opening brace and something that is not a quantifier, perl will assume that a 133character name is coming, and not this meaning of C<\N>. For example, C<\N{3}> 134means to match 3 non-newlines; C<\N{5,}> means to match 5 or more non-newlines, 135but C<\N{4F}> and C<\N{F4}> are not legal quantifiers, and will cause perl to 136look for characters named C<4F> or C<F4>, respectively (and won't find them, 137thus raising an error, unless they have been defined using custom names). 138 139C<\v> will match any character that is considered vertical whitespace; 140this includes the carriage return and line feed characters (newline) plus 5 141other characters listed in the table below. 142C<\V> will match any character that is not considered vertical whitespace. 143 144C<\R> matches anything that can be considered a newline under Unicode 145rules. It's not a character class, as it can match a multi-character 146sequence. Therefore, it cannot be used inside a bracketed character 147class; use C<\v> instead (vertical whitespace). 148Details are discussed in L<perlrebackslash>. 149 150Note that unlike C<\s>, C<\d> and C<\w>, C<\h> and C<\v> always match 151the same characters, regardless whether the source string is in UTF-8 152format or not. The set of characters they match is also not influenced 153by locale nor EBCDIC code page. 154 155One might think that C<\s> is equivalent to C<[\h\v]>. This is not true. The 156vertical tab (C<"\x0b">) is not matched by C<\s>, it is however considered 157vertical whitespace. Furthermore, if the source string is not in UTF-8 format, 158and any locale or EBCDIC code page that is in effect doesn't include them, the 159next line (C<"\x85">) and the no-break space (C<"\xA0">) characters are not 160matched by C<\s>, but are by C<\v> and C<\h> respectively. If the source 161string is in UTF-8 format, both the next line and the no-break space are 162matched by C<\s>. 163 164The following table is a complete listing of characters matched by 165C<\s>, C<\h> and C<\v> as of Unicode 5.2. 166 167The first column gives the code point of the character (in hex format), 168the second column gives the (Unicode) name. The third column indicates 169by which class(es) the character is matched (assuming no locale or EBCDIC code 170page is in effect that changes the C<\s> matching). 171 172 0x00009 CHARACTER TABULATION h s 173 0x0000a LINE FEED (LF) vs 174 0x0000b LINE TABULATION v 175 0x0000c FORM FEED (FF) vs 176 0x0000d CARRIAGE RETURN (CR) vs 177 0x00020 SPACE h s 178 0x00085 NEXT LINE (NEL) vs [1] 179 0x000a0 NO-BREAK SPACE h s [1] 180 0x01680 OGHAM SPACE MARK h s 181 0x0180e MONGOLIAN VOWEL SEPARATOR h s 182 0x02000 EN QUAD h s 183 0x02001 EM QUAD h s 184 0x02002 EN SPACE h s 185 0x02003 EM SPACE h s 186 0x02004 THREE-PER-EM SPACE h s 187 0x02005 FOUR-PER-EM SPACE h s 188 0x02006 SIX-PER-EM SPACE h s 189 0x02007 FIGURE SPACE h s 190 0x02008 PUNCTUATION SPACE h s 191 0x02009 THIN SPACE h s 192 0x0200a HAIR SPACE h s 193 0x02028 LINE SEPARATOR vs 194 0x02029 PARAGRAPH SEPARATOR vs 195 0x0202f NARROW NO-BREAK SPACE h s 196 0x0205f MEDIUM MATHEMATICAL SPACE h s 197 0x03000 IDEOGRAPHIC SPACE h s 198 199=over 4 200 201=item [1] 202 203NEXT LINE and NO-BREAK SPACE only match C<\s> if the source string is in 204UTF-8 format, or the locale or EBCDIC code page that is in effect includes them. 205 206=back 207 208It is worth noting that C<\d>, C<\w>, etc, match single characters, not 209complete numbers or words. To match a number (that consists of integers), 210use C<\d+>; to match a word, use C<\w+>. 211 212 213=head3 Unicode Properties 214 215C<\pP> and C<\p{Prop}> are character classes to match characters that fit given 216Unicode properties. One letter property names can be used in the C<\pP> form, 217with the property name following the C<\p>, otherwise, braces are required. 218When using braces, there is a single form, which is just the property name 219enclosed in the braces, and a compound form which looks like C<\p{name=value}>, 220which means to match if the property "name" for the character has the particular 221"value". 222For instance, a match for a number can be written as C</\pN/> or as 223C</\p{Number}/>, or as C</\p{Number=True}/>. 224Lowercase letters are matched by the property I<Lowercase_Letter> which 225has as short form I<Ll>. They need the braces, so are written as C</\p{Ll}/> or 226C</\p{Lowercase_Letter}/>, or C</\p{General_Category=Lowercase_Letter}/> 227(the underscores are optional). 228C</\pLl/> is valid, but means something different. 229It matches a two character string: a letter (Unicode property C<\pL>), 230followed by a lowercase C<l>. 231 232For more details, see L<perlunicode/Unicode Character Properties>; for a 233complete list of possible properties, see 234L<perluniprops/Properties accessible through \p{} and \P{}>. 235It is also possible to define your own properties. This is discussed in 236L<perlunicode/User-Defined Character Properties>. 237 238 239=head4 Examples 240 241 "a" =~ /\w/ # Match, "a" is a 'word' character. 242 "7" =~ /\w/ # Match, "7" is a 'word' character as well. 243 "a" =~ /\d/ # No match, "a" isn't a digit. 244 "7" =~ /\d/ # Match, "7" is a digit. 245 " " =~ /\s/ # Match, a space is whitespace. 246 "a" =~ /\D/ # Match, "a" is a non-digit. 247 "7" =~ /\D/ # No match, "7" is not a non-digit. 248 " " =~ /\S/ # No match, a space is not non-whitespace. 249 250 " " =~ /\h/ # Match, space is horizontal whitespace. 251 " " =~ /\v/ # No match, space is not vertical whitespace. 252 "\r" =~ /\v/ # Match, a return is vertical whitespace. 253 254 "a" =~ /\pL/ # Match, "a" is a letter. 255 "a" =~ /\p{Lu}/ # No match, /\p{Lu}/ matches upper case letters. 256 257 "\x{0e0b}" =~ /\p{Thai}/ # Match, \x{0e0b} is the character 258 # 'THAI CHARACTER SO SO', and that's in 259 # Thai Unicode class. 260 "a" =~ /\P{Lao}/ # Match, as "a" is not a Laotian character. 261 262 263=head2 Bracketed Character Classes 264 265The third form of character class you can use in Perl regular expressions 266is the bracketed form. In its simplest form, it lists the characters 267that may be matched, surrounded by square brackets, like this: C<[aeiou]>. 268This matches one of C<a>, C<e>, C<i>, C<o> or C<u>. Like the other 269character classes, exactly one character will be matched. To match 270a longer string consisting of characters mentioned in the character 271class, follow the character class with a quantifier. For instance, 272C<[aeiou]+> matches a string of one or more lowercase ASCII vowels. 273 274Repeating a character in a character class has no 275effect; it's considered to be in the set only once. 276 277Examples: 278 279 "e" =~ /[aeiou]/ # Match, as "e" is listed in the class. 280 "p" =~ /[aeiou]/ # No match, "p" is not listed in the class. 281 "ae" =~ /^[aeiou]$/ # No match, a character class only matches 282 # a single character. 283 "ae" =~ /^[aeiou]+$/ # Match, due to the quantifier. 284 285=head3 Special Characters Inside a Bracketed Character Class 286 287Most characters that are meta characters in regular expressions (that 288is, characters that carry a special meaning like C<.>, C<*>, or C<(>) lose 289their special meaning and can be used inside a character class without 290the need to escape them. For instance, C<[()]> matches either an opening 291parenthesis, or a closing parenthesis, and the parens inside the character 292class don't group or capture. 293 294Characters that may carry a special meaning inside a character class are: 295C<\>, C<^>, C<->, C<[> and C<]>, and are discussed below. They can be 296escaped with a backslash, although this is sometimes not needed, in which 297case the backslash may be omitted. 298 299The sequence C<\b> is special inside a bracketed character class. While 300outside the character class C<\b> is an assertion indicating a point 301that does not have either two word characters or two non-word characters 302on either side, inside a bracketed character class, C<\b> matches a 303backspace character. 304 305The sequences 306C<\a>, 307C<\c>, 308C<\e>, 309C<\f>, 310C<\n>, 311C<\N{I<NAME>}>, 312C<\N{U+I<wide hex char>}>, 313C<\r>, 314C<\t>, 315and 316C<\x> 317are also special and have the same meanings as they do outside a bracketed character 318class. 319 320Also, a backslash followed by two or three octal digits is considered an octal 321number. 322 323A C<[> is not special inside a character class, unless it's the start 324of a POSIX character class (see below). It normally does not need escaping. 325 326A C<]> is normally either the end of a POSIX character class (see below), or it 327signals the end of the bracketed character class. If you want to include a 328C<]> in the set of characters, you must generally escape it. 329However, if the C<]> is the I<first> (or the second if the first 330character is a caret) character of a bracketed character class, it 331does not denote the end of the class (as you cannot have an empty class) 332and is considered part of the set of characters that can be matched without 333escaping. 334 335Examples: 336 337 "+" =~ /[+?*]/ # Match, "+" in a character class is not special. 338 "\cH" =~ /[\b]/ # Match, \b inside in a character class 339 # is equivalent to a backspace. 340 "]" =~ /[][]/ # Match, as the character class contains. 341 # both [ and ]. 342 "[]" =~ /[[]]/ # Match, the pattern contains a character class 343 # containing just ], and the character class is 344 # followed by a ]. 345 346=head3 Character Ranges 347 348It is not uncommon to want to match a range of characters. Luckily, instead 349of listing all the characters in the range, one may use the hyphen (C<->). 350If inside a bracketed character class you have two characters separated 351by a hyphen, it's treated as if all the characters between the two are in 352the class. For instance, C<[0-9]> matches any ASCII digit, and C<[a-m]> 353matches any lowercase letter from the first half of the ASCII alphabet. 354 355Note that the two characters on either side of the hyphen are not 356necessary both letters or both digits. Any character is possible, 357although not advisable. C<['-?]> contains a range of characters, but 358most people will not know which characters that will be. Furthermore, 359such ranges may lead to portability problems if the code has to run on 360a platform that uses a different character set, such as EBCDIC. 361 362If a hyphen in a character class cannot syntactically be part of a range, for 363instance because it is the first or the last character of the character class, 364or if it immediately follows a range, the hyphen isn't special, and will be 365considered a character that may be matched literally. You have to escape the 366hyphen with a backslash if you want to have a hyphen in your set of characters 367to be matched, and its position in the class is such that it could be 368considered part of a range. 369 370Examples: 371 372 [a-z] # Matches a character that is a lower case ASCII letter. 373 [a-fz] # Matches any letter between 'a' and 'f' (inclusive) or 374 # the letter 'z'. 375 [-z] # Matches either a hyphen ('-') or the letter 'z'. 376 [a-f-m] # Matches any letter between 'a' and 'f' (inclusive), the 377 # hyphen ('-'), or the letter 'm'. 378 ['-?] # Matches any of the characters '()*+,-./0123456789:;<=>? 379 # (But not on an EBCDIC platform). 380 381 382=head3 Negation 383 384It is also possible to instead list the characters you do not want to 385match. You can do so by using a caret (C<^>) as the first character in the 386character class. For instance, C<[^a-z]> matches a character that is not a 387lowercase ASCII letter. 388 389This syntax make the caret a special character inside a bracketed character 390class, but only if it is the first character of the class. So if you want 391to have the caret as one of the characters you want to match, you either 392have to escape the caret, or not list it first. 393 394Examples: 395 396 "e" =~ /[^aeiou]/ # No match, the 'e' is listed. 397 "x" =~ /[^aeiou]/ # Match, as 'x' isn't a lowercase vowel. 398 "^" =~ /[^^]/ # No match, matches anything that isn't a caret. 399 "^" =~ /[x^]/ # Match, caret is not special here. 400 401=head3 Backslash Sequences 402 403You can put any backslash sequence character class (with the exception of 404C<\N>) inside a bracketed character class, and it will act just 405as if you put all the characters matched by the backslash sequence inside the 406character class. For instance, C<[a-f\d]> will match any digit, or any of the 407lowercase letters between 'a' and 'f' inclusive. 408 409C<\N> within a bracketed character class must be of the forms C<\N{I<name>}> or 410C<\N{U+I<wide hex char>}> for the same reason that a dot C<.> inside a 411bracketed character class loses its special meaning: it matches nearly 412anything, which generally isn't what you want to happen. 413 414Examples: 415 416 /[\p{Thai}\d]/ # Matches a character that is either a Thai 417 # character, or a digit. 418 /[^\p{Arabic}()]/ # Matches a character that is neither an Arabic 419 # character, nor a parenthesis. 420 421Backslash sequence character classes cannot form one of the endpoints 422of a range. 423 424=head3 Posix Character Classes 425X<character class> X<\p> X<\p{}> 426X<alpha> X<alnum> X<ascii> X<blank> X<cntrl> X<digit> X<graph> 427X<lower> X<print> X<punct> X<space> X<upper> X<word> X<xdigit> 428 429Posix character classes have the form C<[:class:]>, where I<class> is 430name, and the C<[:> and C<:]> delimiters. Posix character classes only appear 431I<inside> bracketed character classes, and are a convenient and descriptive 432way of listing a group of characters, though they currently suffer from 433portability issues (see below and L<Locale, EBCDIC, Unicode and UTF-8>). Be 434careful about the syntax, 435 436 # Correct: 437 $string =~ /[[:alpha:]]/ 438 439 # Incorrect (will warn): 440 $string =~ /[:alpha:]/ 441 442The latter pattern would be a character class consisting of a colon, 443and the letters C<a>, C<l>, C<p> and C<h>. 444These character classes can be part of a larger bracketed character class. For 445example, 446 447 [01[:alpha:]%] 448 449is valid and matches '0', '1', any alphabetic character, and the percent sign. 450 451Perl recognizes the following POSIX character classes: 452 453 alpha Any alphabetical character ("[A-Za-z]"). 454 alnum Any alphanumerical character. ("[A-Za-z0-9]") 455 ascii Any character in the ASCII character set. 456 blank A GNU extension, equal to a space or a horizontal tab ("\t"). 457 cntrl Any control character. See Note [2] below. 458 digit Any decimal digit ("[0-9]"), equivalent to "\d". 459 graph Any printable character, excluding a space. See Note [3] below. 460 lower Any lowercase character ("[a-z]"). 461 print Any printable character, including a space. See Note [4] below. 462 punct Any graphical character excluding "word" characters. Note [5]. 463 space Any whitespace character. "\s" plus the vertical tab ("\cK"). 464 upper Any uppercase character ("[A-Z]"). 465 word A Perl extension ("[A-Za-z0-9_]"), equivalent to "\w". 466 xdigit Any hexadecimal digit ("[0-9a-fA-F]"). 467 468Most POSIX character classes have two Unicode-style C<\p> property 469counterparts. (They are not official Unicode properties, but Perl extensions 470derived from official Unicode properties.) The table below shows the relation 471between POSIX character classes and these counterparts. 472 473One counterpart, in the column labelled "ASCII-range Unicode" in 474the table will only match characters in the ASCII range. (On EBCDIC platforms, 475they match those characters which have ASCII equivalents.) 476 477The other counterpart, in the column labelled "Full-range Unicode", matches any 478appropriate characters in the full Unicode character set. For example, 479C<\p{Alpha}> will match not just the ASCII alphabetic characters, but any 480character in the entire Unicode character set that is considered to be 481alphabetic. 482 483(Each of the counterparts has various synonyms as well. 484L<perluniprops/Properties accessible through \p{} and \P{}> lists all the 485synonyms, plus all the characters matched by each of the ASCII-range 486properties. For example C<\p{AHex}> is a synonym for C<\p{ASCII_Hex_Digit}>, 487and any C<\p> property name can be prefixed with "Is" such as C<\p{IsAlpha}>.) 488 489Both the C<\p> forms are unaffected by any locale that is in effect, or whether 490the string is in UTF-8 format or not, or whether the platform is EBCDIC or not. 491In contrast, the POSIX character classes are affected. If the source string is 492in UTF-8 format, the POSIX classes (with the exception of C<[[:punct:]]>, see 493Note [5]) behave like their "Full-range" Unicode counterparts. If the source 494string is not in UTF-8 format, and no locale is in effect, and the platform is 495not EBCDIC, all the POSIX classes behave like their ASCII-range counterparts. 496Otherwise, they behave based on the rules of the locale or EBCDIC code page. 497It is proposed to change this behavior in a future release of Perl so that the 498the UTF8ness of the source string will be irrelevant to the behavior of the 499POSIX character classes. This means they will always behave in strict 500accordance with the official POSIX standard. That is, if either locale or 501EBCDIC code page is present, they will behave in accordance with those; if 502absent, the classes will match only their ASCII-range counterparts. If you 503disagree with this proposal, send email to C<perl5-porters@perl.org>. 504 505 [[:...:]] ASCII-range Full-range backslash Note 506 Unicode Unicode sequence 507 ----------------------------------------------------- 508 alpha \p{PosixAlpha} \p{Alpha} 509 alnum \p{PosixAlnum} \p{Alnum} 510 ascii \p{ASCII} 511 blank \p{PosixBlank} \p{Blank} = [1] 512 \p{HorizSpace} \h [1] 513 cntrl \p{PosixCntrl} \p{Cntrl} [2] 514 digit \p{PosixDigit} \p{Digit} \d 515 graph \p{PosixGraph} \p{Graph} [3] 516 lower \p{PosixLower} \p{Lower} 517 print \p{PosixPrint} \p{Print} [4] 518 punct \p{PosixPunct} \p{Punct} [5] 519 \p{PerlSpace} \p{SpacePerl} \s [6] 520 space \p{PosixSpace} \p{Space} [6] 521 upper \p{PosixUpper} \p{Upper} 522 word \p{PerlWord} \p{Word} \w 523 xdigit \p{ASCII_Hex_Digit} \p{XDigit} 524 525=over 4 526 527=item [1] 528 529C<\p{Blank}> and C<\p{HorizSpace}> are synonyms. 530 531=item [2] 532 533Control characters don't produce output as such, but instead usually control 534the terminal somehow: for example newline and backspace are control characters. 535In the ASCII range, characters whose ordinals are between 0 and 31 inclusive, 536plus 127 (C<DEL>) are control characters. 537 538On EBCDIC platforms, it is likely that the code page will define C<[[:cntrl:]]> 539to be the EBCDIC equivalents of the ASCII controls, plus the controls 540that in Unicode have ordinals from 128 through 139. 541 542=item [3] 543 544Any character that is I<graphical>, that is, visible. This class consists 545of all the alphanumerical characters and all punctuation characters. 546 547=item [4] 548 549All printable characters, which is the set of all the graphical characters 550plus whitespace characters that are not also controls. 551 552=item [5] 553 554C<\p{PosixPunct}> and C<[[:punct:]]> in the ASCII range match all the 555non-controls, non-alphanumeric, non-space characters: 556C<[-!"#$%&'()*+,./:;<=E<gt>?@[\\\]^_`{|}~]> (although if a locale is in effect, 557it could alter the behavior of C<[[:punct:]]>). 558 559When the matching string is in UTF-8 format, C<[[:punct:]]> matches the above 560set, plus what C<\p{Punct}> matches. This is different than strictly matching 561according to C<\p{Punct}>, because the above set includes characters that aren't 562considered punctuation by Unicode, but rather "symbols". Another way to say it 563is that for a UTF-8 string, C<[[:punct:]]> matches all the characters that 564Unicode considers to be punctuation, plus all the ASCII-range characters that 565Unicode considers to be symbols. 566 567=item [6] 568 569C<\p{SpacePerl}> and C<\p{Space}> differ only in that C<\p{Space}> additionally 570matches the vertical tab, C<\cK>. Same for the two ASCII-only range forms. 571 572=back 573 574=head4 Negation 575X<character class, negation> 576 577A Perl extension to the POSIX character class is the ability to 578negate it. This is done by prefixing the class name with a caret (C<^>). 579Some examples: 580 581 POSIX ASCII-range Full-range backslash 582 Unicode Unicode sequence 583 ----------------------------------------------------- 584 [[:^digit:]] \P{PosixDigit} \P{Digit} \D 585 [[:^space:]] \P{PosixSpace} \P{Space} 586 \P{PerlSpace} \P{SpacePerl} \S 587 [[:^word:]] \P{PerlWord} \P{Word} \W 588 589=head4 [= =] and [. .] 590 591Perl will recognize the POSIX character classes C<[=class=]>, and 592C<[.class.]>, but does not (yet?) support them. Use of 593such a construct will lead to an error. 594 595 596=head4 Examples 597 598 /[[:digit:]]/ # Matches a character that is a digit. 599 /[01[:lower:]]/ # Matches a character that is either a 600 # lowercase letter, or '0' or '1'. 601 /[[:digit:][:^xdigit:]]/ # Matches a character that can be anything 602 # except the letters 'a' to 'f'. This is 603 # because the main character class is composed 604 # of two POSIX character classes that are ORed 605 # together, one that matches any digit, and 606 # the other that matches anything that isn't a 607 # hex digit. The result matches all 608 # characters except the letters 'a' to 'f' and 609 # 'A' to 'F'. 610 611 612=head2 Locale, EBCDIC, Unicode and UTF-8 613 614Some of the character classes have a somewhat different behaviour depending 615on the internal encoding of the source string, and the locale that is 616in effect, and if the program is running on an EBCDIC platform. 617 618C<\w>, C<\d>, C<\s> and the POSIX character classes (and their negations, 619including C<\W>, C<\D>, C<\S>) suffer from this behaviour. (Since the backslash 620sequences C<\b> and C<\B> are defined in terms of C<\w> and C<\W>, they also are 621affected.) 622 623The rule is that if the source string is in UTF-8 format, the character 624classes match according to the Unicode properties. If the source string 625isn't, then the character classes match according to whatever locale or EBCDIC 626code page is in effect. If there is no locale nor EBCDIC, they match the ASCII 627defaults (52 letters, 10 digits and underscore for C<\w>; 0 to 9 for C<\d>; 628etc.). 629 630This usually means that if you are matching against characters whose C<ord()> 631values are between 128 and 255 inclusive, your character class may match 632or not depending on the current locale or EBCDIC code page, and whether the 633source string is in UTF-8 format. The string will be in UTF-8 format if it 634contains characters whose C<ord()> value exceeds 255. But a string may be in 635UTF-8 format without it having such characters. See L<perluniprops/The 636"Unicode Bug">. 637 638For portability reasons, it may be better to not use C<\w>, C<\d>, C<\s> 639or the POSIX character classes, and use the Unicode properties instead. 640 641=head4 Examples 642 643 $str = "\xDF"; # $str is not in UTF-8 format. 644 $str =~ /^\w/; # No match, as $str isn't in UTF-8 format. 645 $str .= "\x{0e0b}"; # Now $str is in UTF-8 format. 646 $str =~ /^\w/; # Match! $str is now in UTF-8 format. 647 chop $str; 648 $str =~ /^\w/; # Still a match! $str remains in UTF-8 format. 649 650=cut 651