1=head1 NAME 2 3perlunicode - Unicode support in Perl 4 5=head1 DESCRIPTION 6 7=head2 Important Caveats 8 9Unicode support is an extensive requirement. While Perl does not 10implement the Unicode standard or the accompanying technical reports 11from cover to cover, Perl does support many Unicode features. 12 13People who want to learn to use Unicode in Perl, should probably read 14the L<Perl Unicode tutorial, perlunitut|perlunitut> and 15L<perluniintro>, before reading 16this reference document. 17 18Also, the use of Unicode may present security issues that aren't obvious. 19Read L<Unicode Security Considerations|http://www.unicode.org/reports/tr36>. 20 21=over 4 22 23=item Safest if you C<use feature 'unicode_strings'> 24 25In order to preserve backward compatibility, Perl does not turn 26on full internal Unicode support unless the pragma 27C<use feature 'unicode_strings'> is specified. (This is automatically 28selected if you use C<use 5.012> or higher.) Failure to do this can 29trigger unexpected surprises. See L</The "Unicode Bug"> below. 30 31This pragma doesn't affect I/O. Nor does it change the internal 32representation of strings, only their interpretation. There are still 33several places where Unicode isn't fully supported, such as in 34filenames. 35 36=item Input and Output Layers 37 38Perl knows when a filehandle uses Perl's internal Unicode encodings 39(UTF-8, or UTF-EBCDIC if in EBCDIC) if the filehandle is opened with 40the C<:encoding(utf8)> layer. Other encodings can be converted to Perl's 41encoding on input or from Perl's encoding on output by use of the 42C<:encoding(...)> layer. See L<open>. 43 44To indicate that Perl source itself is in UTF-8, use C<use utf8;>. 45 46=item C<use utf8> still needed to enable UTF-8/UTF-EBCDIC in scripts 47 48As a compatibility measure, the C<use utf8> pragma must be explicitly 49included to enable recognition of UTF-8 in the Perl scripts themselves 50(in string or regular expression literals, or in identifier names) on 51ASCII-based machines or to recognize UTF-EBCDIC on EBCDIC-based 52machines. B<These are the only times when an explicit C<use utf8> 53is needed.> See L<utf8>. 54 55=item C<BOM>-marked scripts and UTF-16 scripts autodetected 56 57If a Perl script begins marked with the Unicode C<BOM> (UTF-16LE, UTF16-BE, 58or UTF-8), or if the script looks like non-C<BOM>-marked UTF-16 of either 59endianness, Perl will correctly read in the script as Unicode. 60(C<BOM>less UTF-8 cannot be effectively recognized or differentiated from 61ISO 8859-1 or other eight-bit encodings.) 62 63=item C<use encoding> needed to upgrade non-Latin-1 byte strings 64 65By default, there is a fundamental asymmetry in Perl's Unicode model: 66implicit upgrading from byte strings to Unicode strings assumes that 67they were encoded in I<ISO 8859-1 (Latin-1)>, but Unicode strings are 68downgraded with UTF-8 encoding. This happens because the first 256 69codepoints in Unicode happens to agree with Latin-1. 70 71See L</"Byte and Character Semantics"> for more details. 72 73=back 74 75=head2 Byte and Character Semantics 76 77Perl uses logically-wide characters to represent strings internally. 78 79Starting in Perl 5.14, Perl-level operations work with 80characters rather than bytes within the scope of a 81C<L<use feature 'unicode_strings'|feature>> (or equivalently 82C<use 5.012> or higher). (This is not true if bytes have been 83explicitly requested by C<L<use bytes|bytes>>, nor necessarily true 84for interactions with the platform's operating system.) 85 86For earlier Perls, and when C<unicode_strings> is not in effect, Perl 87provides a fairly safe environment that can handle both types of 88semantics in programs. For operations where Perl can unambiguously 89decide that the input data are characters, Perl switches to character 90semantics. For operations where this determination cannot be made 91without additional information from the user, Perl decides in favor of 92compatibility and chooses to use byte semantics. 93 94When C<use locale> (but not C<use locale ':not_characters'>) is in 95effect, Perl uses the rules associated with the current locale. 96(C<use locale> overrides C<use feature 'unicode_strings'> in the same scope; 97while C<use locale ':not_characters'> effectively also selects 98C<use feature 'unicode_strings'> in its scope; see L<perllocale>.) 99Otherwise, Perl uses the platform's native 100byte semantics for characters whose code points are less than 256, and 101Unicode rules for those greater than 255. That means that non-ASCII 102characters are undefined except for their 103ordinal numbers. This means that none have case (upper and lower), nor are any 104a member of character classes, like C<[:alpha:]> or C<\w>. (But all do belong 105to the C<\W> class or the Perl regular expression extension C<[:^alpha:]>.) 106 107This behavior preserves compatibility with earlier versions of Perl, 108which allowed byte semantics in Perl operations only if 109none of the program's inputs were marked as being a source of Unicode 110character data. Such data may come from filehandles, from calls to 111external programs, from information provided by the system (such as C<%ENV>), 112or from literals and constants in the source text. 113 114The C<utf8> pragma is primarily a compatibility device that enables 115recognition of UTF-(8|EBCDIC) in literals encountered by the parser. 116Note that this pragma is only required while Perl defaults to byte 117semantics; when character semantics become the default, this pragma 118may become a no-op. See L<utf8>. 119 120If strings operating under byte semantics and strings with Unicode 121character data are concatenated, the new string will have 122character semantics. This can cause surprises: See L</BUGS>, below. 123You can choose to be warned when this happens. See C<L<encoding::warnings>>. 124 125Under character semantics, many operations that formerly operated on 126bytes now operate on characters. A character in Perl is 127logically just a number ranging from 0 to 2**31 or so. Larger 128characters may encode into longer sequences of bytes internally, but 129this internal detail is mostly hidden for Perl code. 130See L<perluniintro> for more. 131 132=head2 Effects of Character Semantics 133 134Character semantics have the following effects: 135 136=over 4 137 138=item * 139 140Strings--including hash keys--and regular expression patterns may 141contain characters that have an ordinal value larger than 255. 142 143If you use a Unicode editor to edit your program, Unicode characters may 144occur directly within the literal strings in UTF-8 encoding, or UTF-16. 145(The former requires a C<BOM> or C<use utf8>, the latter requires a C<BOM>.) 146 147Unicode characters can also be added to a string by using the C<\N{U+...}> 148notation. The Unicode code for the desired character, in hexadecimal, 149should be placed in the braces, after the C<U>. For instance, a smiley face is 150C<\N{U+263A}>. 151 152Alternatively, you can use the C<\x{...}> notation for characters C<0x100> and 153above. For characters below C<0x100> you may get byte semantics instead of 154character semantics; see L</The "Unicode Bug">. On EBCDIC machines there is 155the additional problem that the value for such characters gives the EBCDIC 156character rather than the Unicode one, thus it is more portable to use 157C<\N{U+...}> instead. 158 159Additionally, you can use the C<\N{...}> notation and put the official 160Unicode character name within the braces, such as 161C<\N{WHITE SMILING FACE}>. This automatically loads the L<charnames> 162module with the C<:full> and C<:short> options. If you prefer different 163options for this module, you can instead, before the C<\N{...}>, 164explicitly load it with your desired options; for example, 165 166 use charnames ':loose'; 167 168=item * 169 170If an appropriate L<encoding> is specified, identifiers within the 171Perl script may contain Unicode alphanumeric characters, including 172ideographs. Perl does not currently attempt to canonicalize variable 173names. 174 175=item * 176 177Regular expressions match characters instead of bytes. C<"."> matches 178a character instead of a byte. 179 180=item * 181 182Bracketed character classes in regular expressions match characters instead of 183bytes and match against the character properties specified in the 184Unicode properties database. C<\w> can be used to match a Japanese 185ideograph, for instance. 186 187=item * 188 189Named Unicode properties, scripts, and block ranges may be used (like bracketed 190character classes) by using the C<\p{}> "matches property" construct and 191the C<\P{}> negation, "doesn't match property". 192See L</"Unicode Character Properties"> for more details. 193 194You can define your own character properties and use them 195in the regular expression with the C<\p{}> or C<\P{}> construct. 196See L</"User-Defined Character Properties"> for more details. 197 198=item * 199 200The special pattern C<\X> matches a logical character, an "extended grapheme 201cluster" in Standardese. In Unicode what appears to the user to be a single 202character, for example an accented C<G>, may in fact be composed of a sequence 203of characters, in this case a C<G> followed by an accent character. C<\X> 204will match the entire sequence. 205 206=item * 207 208The C<tr///> operator translates characters instead of bytes. Note 209that the C<tr///CU> functionality has been removed. For similar 210functionality see pack('U0', ...) and pack('C0', ...). 211 212=item * 213 214Case translation operators use the Unicode case translation tables 215when character input is provided. Note that C<uc()>, or C<\U> in 216interpolated strings, translates to uppercase, while C<ucfirst>, 217or C<\u> in interpolated strings, translates to titlecase in languages 218that make the distinction (which is equivalent to uppercase in languages 219without the distinction). 220 221=item * 222 223Most operators that deal with positions or lengths in a string will 224automatically switch to using character positions, including 225C<chop()>, C<chomp()>, C<substr()>, C<pos()>, C<index()>, C<rindex()>, 226C<sprintf()>, C<write()>, and C<length()>. An operator that 227specifically does not switch is C<vec()>. Operators that really don't 228care include operators that treat strings as a bucket of bits such as 229C<sort()>, and operators dealing with filenames. 230 231=item * 232 233The C<pack()>/C<unpack()> letter C<C> does I<not> change, since it is often 234used for byte-oriented formats. Again, think C<char> in the C language. 235 236There is a new C<U> specifier that converts between Unicode characters 237and code points. There is also a C<W> specifier that is the equivalent of 238C<chr>/C<ord> and properly handles character values even if they are above 255. 239 240=item * 241 242The C<chr()> and C<ord()> functions work on characters, similar to 243C<pack("W")> and C<unpack("W")>, I<not> C<pack("C")> and 244C<unpack("C")>. C<pack("C")> and C<unpack("C")> are methods for 245emulating byte-oriented C<chr()> and C<ord()> on Unicode strings. 246While these methods reveal the internal encoding of Unicode strings, 247that is not something one normally needs to care about at all. 248 249=item * 250 251The bit string operators, C<& | ^ ~>, can operate on character data. 252However, for backward compatibility, such as when using bit string 253operations when characters are all less than 256 in ordinal value, one 254should not use C<~> (the bit complement) with characters of both 255values less than 256 and values greater than 256. Most importantly, 256DeMorgan's laws (C<~($x|$y) eq ~$x&~$y> and C<~($x&$y) eq ~$x|~$y>) 257will not hold. The reason for this mathematical I<faux pas> is that 258the complement cannot return B<both> the 8-bit (byte-wide) bit 259complement B<and> the full character-wide bit complement. 260 261=item * 262 263There is a CPAN module, C<L<Unicode::Casing>>, which allows you to define 264your own mappings to be used in C<lc()>, C<lcfirst()>, C<uc()>, 265C<ucfirst()>, and C<fc> (or their double-quoted string inlined 266versions such as C<\U>). 267(Prior to Perl 5.16, this functionality was partially provided 268in the Perl core, but suffered from a number of insurmountable 269drawbacks, so the CPAN module was written instead.) 270 271=back 272 273=over 4 274 275=item * 276 277And finally, C<scalar reverse()> reverses by character rather than by byte. 278 279=back 280 281=head2 Unicode Character Properties 282 283(The only time that Perl considers a sequence of individual code 284points as a single logical character is in the C<\X> construct, already 285mentioned above. Therefore "character" in this discussion means a single 286Unicode code point.) 287 288Very nearly all Unicode character properties are accessible through 289regular expressions by using the C<\p{}> "matches property" construct 290and the C<\P{}> "doesn't match property" for its negation. 291 292For instance, C<\p{Uppercase}> matches any single character with the Unicode 293C<"Uppercase"> property, while C<\p{L}> matches any character with a 294C<General_Category> of C<"L"> (letter) property (see 295L</General_Category> below). Brackets are not 296required for single letter property names, so C<\p{L}> is equivalent to C<\pL>. 297 298More formally, C<\p{Uppercase}> matches any single character whose Unicode 299C<Uppercase> property value is C<True>, and C<\P{Uppercase}> matches any character 300whose C<Uppercase> property value is C<False>, and they could have been written as 301C<\p{Uppercase=True}> and C<\p{Uppercase=False}>, respectively. 302 303This formality is needed when properties are not binary; that is, if they can 304take on more values than just C<True> and C<False>. For example, the 305C<Bidi_Class> property (see L</"Bidirectional Character Types"> below), 306can take on several different 307values, such as C<Left>, C<Right>, C<Whitespace>, and others. To match these, one needs 308to specify both the property name (C<Bidi_Class>), AND the value being 309matched against 310(C<Left>, C<Right>, etc.). This is done, as in the examples above, by having the 311two components separated by an equal sign (or interchangeably, a colon), like 312C<\p{Bidi_Class: Left}>. 313 314All Unicode-defined character properties may be written in these compound forms 315of C<\p{I<property>=I<value>}> or C<\p{I<property>:I<value>}>, but Perl provides some 316additional properties that are written only in the single form, as well as 317single-form short-cuts for all binary properties and certain others described 318below, in which you may omit the property name and the equals or colon 319separator. 320 321Most Unicode character properties have at least two synonyms (or aliases if you 322prefer): a short one that is easier to type and a longer one that is more 323descriptive and hence easier to understand. Thus the C<"L"> and 324C<"Letter"> properties above are equivalent and can be used 325interchangeably. Likewise, C<"Upper"> is a synonym for C<"Uppercase">, 326and we could have written C<\p{Uppercase}> equivalently as C<\p{Upper}>. 327Also, there are typically various synonyms for the values the property 328can be. For binary properties, C<"True"> has 3 synonyms: C<"T">, 329C<"Yes">, and C<"Y">; and C<"False"> has correspondingly C<"F">, 330C<"No">, and C<"N">. But be careful. A short form of a value for one 331property may not mean the same thing as the same short form for another. 332Thus, for the C<L</General_Category>> property, C<"L"> means 333C<"Letter">, but for the L<C<Bidi_Class>|/Bidirectional Character Types> 334property, C<"L"> means C<"Left">. A complete list of properties and 335synonyms is in L<perluniprops>. 336 337Upper/lower case differences in property names and values are irrelevant; 338thus C<\p{Upper}> means the same thing as C<\p{upper}> or even C<\p{UpPeR}>. 339Similarly, you can add or subtract underscores anywhere in the middle of a 340word, so that these are also equivalent to C<\p{U_p_p_e_r}>. And white space 341is irrelevant adjacent to non-word characters, such as the braces and the equals 342or colon separators, so C<\p{ Upper }> and C<\p{ Upper_case : Y }> are 343equivalent to these as well. In fact, white space and even 344hyphens can usually be added or deleted anywhere. So even C<\p{ Up-per case = Yes}> is 345equivalent. All this is called "loose-matching" by Unicode. The few places 346where stricter matching is used is in the middle of numbers, and in the Perl 347extension properties that begin or end with an underscore. Stricter matching 348cares about white space (except adjacent to non-word characters), 349hyphens, and non-interior underscores. 350 351You can also use negation in both C<\p{}> and C<\P{}> by introducing a caret 352(C<^>) between the first brace and the property name: C<\p{^Tamil}> is 353equal to C<\P{Tamil}>. 354 355Almost all properties are immune to case-insensitive matching. That is, 356adding a C</i> regular expression modifier does not change what they 357match. There are two sets that are affected. 358The first set is 359C<Uppercase_Letter>, 360C<Lowercase_Letter>, 361and C<Titlecase_Letter>, 362all of which match C<Cased_Letter> under C</i> matching. 363And the second set is 364C<Uppercase>, 365C<Lowercase>, 366and C<Titlecase>, 367all of which match C<Cased> under C</i> matching. 368This set also includes its subsets C<PosixUpper> and C<PosixLower> both 369of which under C</i> match C<PosixAlpha>. 370(The difference between these sets is that some things, such as Roman 371numerals, come in both upper and lower case so they are C<Cased>, but aren't considered 372letters, so they aren't C<Cased_Letter>s.) 373 374See L</Beyond Unicode code points> for special considerations when 375matching Unicode properties against non-Unicode code points. 376 377=head3 B<General_Category> 378 379Every Unicode character is assigned a general category, which is the "most 380usual categorization of a character" (from 381L<http://www.unicode.org/reports/tr44>). 382 383The compound way of writing these is like C<\p{General_Category=Number}> 384(short, C<\p{gc:n}>). But Perl furnishes shortcuts in which everything up 385through the equal or colon separator is omitted. So you can instead just write 386C<\pN>. 387 388Here are the short and long forms of the values the C<General Category> property 389can have: 390 391 Short Long 392 393 L Letter 394 LC, L& Cased_Letter (that is: [\p{Ll}\p{Lu}\p{Lt}]) 395 Lu Uppercase_Letter 396 Ll Lowercase_Letter 397 Lt Titlecase_Letter 398 Lm Modifier_Letter 399 Lo Other_Letter 400 401 M Mark 402 Mn Nonspacing_Mark 403 Mc Spacing_Mark 404 Me Enclosing_Mark 405 406 N Number 407 Nd Decimal_Number (also Digit) 408 Nl Letter_Number 409 No Other_Number 410 411 P Punctuation (also Punct) 412 Pc Connector_Punctuation 413 Pd Dash_Punctuation 414 Ps Open_Punctuation 415 Pe Close_Punctuation 416 Pi Initial_Punctuation 417 (may behave like Ps or Pe depending on usage) 418 Pf Final_Punctuation 419 (may behave like Ps or Pe depending on usage) 420 Po Other_Punctuation 421 422 S Symbol 423 Sm Math_Symbol 424 Sc Currency_Symbol 425 Sk Modifier_Symbol 426 So Other_Symbol 427 428 Z Separator 429 Zs Space_Separator 430 Zl Line_Separator 431 Zp Paragraph_Separator 432 433 C Other 434 Cc Control (also Cntrl) 435 Cf Format 436 Cs Surrogate 437 Co Private_Use 438 Cn Unassigned 439 440Single-letter properties match all characters in any of the 441two-letter sub-properties starting with the same letter. 442C<LC> and C<L&> are special: both are aliases for the set consisting of everything matched by C<Ll>, C<Lu>, and C<Lt>. 443 444=head3 B<Bidirectional Character Types> 445 446Because scripts differ in their directionality (Hebrew and Arabic are 447written right to left, for example) Unicode supplies a C<Bidi_Class> property. 448Some of the values this property can have are: 449 450 Value Meaning 451 452 L Left-to-Right 453 LRE Left-to-Right Embedding 454 LRO Left-to-Right Override 455 R Right-to-Left 456 AL Arabic Letter 457 RLE Right-to-Left Embedding 458 RLO Right-to-Left Override 459 PDF Pop Directional Format 460 EN European Number 461 ES European Separator 462 ET European Terminator 463 AN Arabic Number 464 CS Common Separator 465 NSM Non-Spacing Mark 466 BN Boundary Neutral 467 B Paragraph Separator 468 S Segment Separator 469 WS Whitespace 470 ON Other Neutrals 471 472This property is always written in the compound form. 473For example, C<\p{Bidi_Class:R}> matches characters that are normally 474written right to left. Unlike the 475C<L</General_Category>> property, this 476property can have more values added in a future Unicode release. Those 477listed above comprised the complete set for many Unicode releases, but 478others were added in Unicode 6.3; you can always find what the 479current ones are in in L<perluniprops>. And 480L<http://www.unicode.org/reports/tr9/> describes how to use them. 481 482=head3 B<Scripts> 483 484The world's languages are written in many different scripts. This sentence 485(unless you're reading it in translation) is written in Latin, while Russian is 486written in Cyrillic, and Greek is written in, well, Greek; Japanese mainly in 487Hiragana or Katakana. There are many more. 488 489The Unicode Script and Script_Extensions properties give what script a 490given character is in. Either property can be specified with the 491compound form like 492C<\p{Script=Hebrew}> (short: C<\p{sc=hebr}>), or 493C<\p{Script_Extensions=Javanese}> (short: C<\p{scx=java}>). 494In addition, Perl furnishes shortcuts for all 495C<Script> property names. You can omit everything up through the equals 496(or colon), and simply write C<\p{Latin}> or C<\P{Cyrillic}>. 497(This is not true for C<Script_Extensions>, which is required to be 498written in the compound form.) 499 500The difference between these two properties involves characters that are 501used in multiple scripts. For example the digits '0' through '9' are 502used in many parts of the world. These are placed in a script named 503C<Common>. Other characters are used in just a few scripts. For 504example, the C<"KATAKANA-HIRAGANA DOUBLE HYPHEN"> is used in both Japanese 505scripts, Katakana and Hiragana, but nowhere else. The C<Script> 506property places all characters that are used in multiple scripts in the 507C<Common> script, while the C<Script_Extensions> property places those 508that are used in only a few scripts into each of those scripts; while 509still using C<Common> for those used in many scripts. Thus both these 510match: 511 512 "0" =~ /\p{sc=Common}/ # Matches 513 "0" =~ /\p{scx=Common}/ # Matches 514 515and only the first of these match: 516 517 "\N{KATAKANA-HIRAGANA DOUBLE HYPHEN}" =~ /\p{sc=Common} # Matches 518 "\N{KATAKANA-HIRAGANA DOUBLE HYPHEN}" =~ /\p{scx=Common} # No match 519 520And only the last two of these match: 521 522 "\N{KATAKANA-HIRAGANA DOUBLE HYPHEN}" =~ /\p{sc=Hiragana} # No match 523 "\N{KATAKANA-HIRAGANA DOUBLE HYPHEN}" =~ /\p{sc=Katakana} # No match 524 "\N{KATAKANA-HIRAGANA DOUBLE HYPHEN}" =~ /\p{scx=Hiragana} # Matches 525 "\N{KATAKANA-HIRAGANA DOUBLE HYPHEN}" =~ /\p{scx=Katakana} # Matches 526 527C<Script_Extensions> is thus an improved C<Script>, in which there are 528fewer characters in the C<Common> script, and correspondingly more in 529other scripts. It is new in Unicode version 6.0, and its data are likely 530to change significantly in later releases, as things get sorted out. 531 532(Actually, besides C<Common>, the C<Inherited> script, contains 533characters that are used in multiple scripts. These are modifier 534characters which modify other characters, and inherit the script value 535of the controlling character. Some of these are used in many scripts, 536and so go into C<Inherited> in both C<Script> and C<Script_Extensions>. 537Others are used in just a few scripts, so are in C<Inherited> in 538C<Script>, but not in C<Script_Extensions>.) 539 540It is worth stressing that there are several different sets of digits in 541Unicode that are equivalent to 0-9 and are matchable by C<\d> in a 542regular expression. If they are used in a single language only, they 543are in that language's C<Script> and C<Script_Extension>. If they are 544used in more than one script, they will be in C<sc=Common>, but only 545if they are used in many scripts should they be in C<scx=Common>. 546 547A complete list of scripts and their shortcuts is in L<perluniprops>. 548 549=head3 B<Use of the C<"Is"> Prefix> 550 551For backward compatibility (with Perl 5.6), all properties mentioned 552so far may have C<Is> or C<Is_> prepended to their name, so C<\P{Is_Lu}>, for 553example, is equal to C<\P{Lu}>, and C<\p{IsScript:Arabic}> is equal to 554C<\p{Arabic}>. 555 556=head3 B<Blocks> 557 558In addition to B<scripts>, Unicode also defines B<blocks> of 559characters. The difference between scripts and blocks is that the 560concept of scripts is closer to natural languages, while the concept 561of blocks is more of an artificial grouping based on groups of Unicode 562characters with consecutive ordinal values. For example, the C<"Basic Latin"> 563block is all characters whose ordinals are between 0 and 127, inclusive; in 564other words, the ASCII characters. The C<"Latin"> script contains some letters 565from this as well as several other blocks, like C<"Latin-1 Supplement">, 566C<"Latin Extended-A">, etc., but it does not contain all the characters from 567those blocks. It does not, for example, contain the digits 0-9, because 568those digits are shared across many scripts, and hence are in the 569C<Common> script. 570 571For more about scripts versus blocks, see UAX#24 "Unicode Script Property": 572L<http://www.unicode.org/reports/tr24> 573 574The C<Script> or C<Script_Extensions> properties are likely to be the 575ones you want to use when processing 576natural language; the C<Block> property may occasionally be useful in working 577with the nuts and bolts of Unicode. 578 579Block names are matched in the compound form, like C<\p{Block: Arrows}> or 580C<\p{Blk=Hebrew}>. Unlike most other properties, only a few block names have a 581Unicode-defined short name. But Perl does provide a (slight) shortcut: You 582can say, for example C<\p{In_Arrows}> or C<\p{In_Hebrew}>. For backwards 583compatibility, the C<In> prefix may be omitted if there is no naming conflict 584with a script or any other property, and you can even use an C<Is> prefix 585instead in those cases. But it is not a good idea to do this, for a couple 586reasons: 587 588=over 4 589 590=item 1 591 592It is confusing. There are many naming conflicts, and you may forget some. 593For example, C<\p{Hebrew}> means the I<script> Hebrew, and NOT the I<block> 594Hebrew. But would you remember that 6 months from now? 595 596=item 2 597 598It is unstable. A new version of Unicode may preempt the current meaning by 599creating a property with the same name. There was a time in very early Unicode 600releases when C<\p{Hebrew}> would have matched the I<block> Hebrew; now it 601doesn't. 602 603=back 604 605Some people prefer to always use C<\p{Block: foo}> and C<\p{Script: bar}> 606instead of the shortcuts, whether for clarity, because they can't remember the 607difference between 'In' and 'Is' anyway, or they aren't confident that those who 608eventually will read their code will know that difference. 609 610A complete list of blocks and their shortcuts is in L<perluniprops>. 611 612=head3 B<Other Properties> 613 614There are many more properties than the very basic ones described here. 615A complete list is in L<perluniprops>. 616 617Unicode defines all its properties in the compound form, so all single-form 618properties are Perl extensions. Most of these are just synonyms for the 619Unicode ones, but some are genuine extensions, including several that are in 620the compound form. And quite a few of these are actually recommended by Unicode 621(in L<http://www.unicode.org/reports/tr18>). 622 623This section gives some details on all extensions that aren't just 624synonyms for compound-form Unicode properties 625(for those properties, you'll have to refer to the 626L<Unicode Standard|http://www.unicode.org/reports/tr44>. 627 628=over 629 630=item B<C<\p{All}>> 631 632This matches every possible code point. It is equivalent to C<qr/./s>. 633Unlike all the other non-user-defined C<\p{}> property matches, no 634warning is ever generated if this is property is matched against a 635non-Unicode code point (see L</Beyond Unicode code points> below). 636 637=item B<C<\p{Alnum}>> 638 639This matches any C<\p{Alphabetic}> or C<\p{Decimal_Number}> character. 640 641=item B<C<\p{Any}>> 642 643This matches any of the 1_114_112 Unicode code points. It is a synonym 644for C<\p{Unicode}>. 645 646=item B<C<\p{ASCII}>> 647 648This matches any of the 128 characters in the US-ASCII character set, 649which is a subset of Unicode. 650 651=item B<C<\p{Assigned}>> 652 653This matches any assigned code point; that is, any code point whose L<general 654category|/General_Category> is not C<Unassigned> (or equivalently, not C<Cn>). 655 656=item B<C<\p{Blank}>> 657 658This is the same as C<\h> and C<\p{HorizSpace}>: A character that changes the 659spacing horizontally. 660 661=item B<C<\p{Decomposition_Type: Non_Canonical}>> (Short: C<\p{Dt=NonCanon}>) 662 663Matches a character that has a non-canonical decomposition. 664 665To understand the use of this rarely used I<property=value> combination, it is 666necessary to know some basics about decomposition. 667Consider a character, say H. It could appear with various marks around it, 668such as an acute accent, or a circumflex, or various hooks, circles, arrows, 669I<etc.>, above, below, to one side or the other, etc. There are many 670possibilities among the world's languages. The number of combinations is 671astronomical, and if there were a character for each combination, it would 672soon exhaust Unicode's more than a million possible characters. So Unicode 673took a different approach: there is a character for the base H, and a 674character for each of the possible marks, and these can be variously combined 675to get a final logical character. So a logical character--what appears to be a 676single character--can be a sequence of more than one individual characters. 677This is called an "extended grapheme cluster"; Perl furnishes the C<\X> 678regular expression construct to match such sequences. 679 680But Unicode's intent is to unify the existing character set standards and 681practices, and several pre-existing standards have single characters that 682mean the same thing as some of these combinations. An example is ISO-8859-1, 683which has quite a few of these in the Latin-1 range, an example being C<"LATIN 684CAPITAL LETTER E WITH ACUTE">. Because this character was in this pre-existing 685standard, Unicode added it to its repertoire. But this character is considered 686by Unicode to be equivalent to the sequence consisting of the character 687C<"LATIN CAPITAL LETTER E"> followed by the character C<"COMBINING ACUTE ACCENT">. 688 689C<"LATIN CAPITAL LETTER E WITH ACUTE"> is called a "pre-composed" character, and 690its equivalence with the sequence is called canonical equivalence. All 691pre-composed characters are said to have a decomposition (into the equivalent 692sequence), and the decomposition type is also called canonical. 693 694However, many more characters have a different type of decomposition, a 695"compatible" or "non-canonical" decomposition. The sequences that form these 696decompositions are not considered canonically equivalent to the pre-composed 697character. An example, again in the Latin-1 range, is the C<"SUPERSCRIPT ONE">. 698It is somewhat like a regular digit 1, but not exactly; its decomposition 699into the digit 1 is called a "compatible" decomposition, specifically a 700"super" decomposition. There are several such compatibility 701decompositions (see L<http://www.unicode.org/reports/tr44>), including one 702called "compat", which means some miscellaneous type of decomposition 703that doesn't fit into the decomposition categories that Unicode has chosen. 704 705Note that most Unicode characters don't have a decomposition, so their 706decomposition type is C<"None">. 707 708For your convenience, Perl has added the C<Non_Canonical> decomposition 709type to mean any of the several compatibility decompositions. 710 711=item B<C<\p{Graph}>> 712 713Matches any character that is graphic. Theoretically, this means a character 714that on a printer would cause ink to be used. 715 716=item B<C<\p{HorizSpace}>> 717 718This is the same as C<\h> and C<\p{Blank}>: a character that changes the 719spacing horizontally. 720 721=item B<C<\p{In=*}>> 722 723This is a synonym for C<\p{Present_In=*}> 724 725=item B<C<\p{PerlSpace}>> 726 727This is the same as C<\s>, restricted to ASCII, namely C<S<[ \f\n\r\t]>> 728and starting in Perl v5.18, experimentally, a vertical tab. 729 730Mnemonic: Perl's (original) space 731 732=item B<C<\p{PerlWord}>> 733 734This is the same as C<\w>, restricted to ASCII, namely C<[A-Za-z0-9_]> 735 736Mnemonic: Perl's (original) word. 737 738=item B<C<\p{Posix...}>> 739 740There are several of these, which are equivalents using the C<\p{}> 741notation for Posix classes and are described in 742L<perlrecharclass/POSIX Character Classes>. 743 744=item B<C<\p{Present_In: *}>> (Short: C<\p{In=*}>) 745 746This property is used when you need to know in what Unicode version(s) a 747character is. 748 749The "*" above stands for some two digit Unicode version number, such as 750C<1.1> or C<4.0>; or the "*" can also be C<Unassigned>. This property will 751match the code points whose final disposition has been settled as of the 752Unicode release given by the version number; C<\p{Present_In: Unassigned}> 753will match those code points whose meaning has yet to be assigned. 754 755For example, C<U+0041> C<"LATIN CAPITAL LETTER A"> was present in the very first 756Unicode release available, which is C<1.1>, so this property is true for all 757valid "*" versions. On the other hand, C<U+1EFF> was not assigned until version 7585.1 when it became C<"LATIN SMALL LETTER Y WITH LOOP">, so the only "*" that 759would match it are 5.1, 5.2, and later. 760 761Unicode furnishes the C<Age> property from which this is derived. The problem 762with Age is that a strict interpretation of it (which Perl takes) has it 763matching the precise release a code point's meaning is introduced in. Thus 764C<U+0041> would match only 1.1; and C<U+1EFF> only 5.1. This is not usually what 765you want. 766 767Some non-Perl implementations of the Age property may change its meaning to be 768the same as the Perl C<Present_In> property; just be aware of that. 769 770Another confusion with both these properties is that the definition is not 771that the code point has been I<assigned>, but that the meaning of the code point 772has been I<determined>. This is because 66 code points will always be 773unassigned, and so the C<Age> for them is the Unicode version in which the decision 774to make them so was made. For example, C<U+FDD0> is to be permanently 775unassigned to a character, and the decision to do that was made in version 3.1, 776so C<\p{Age=3.1}> matches this character, as also does C<\p{Present_In: 3.1}> and up. 777 778=item B<C<\p{Print}>> 779 780This matches any character that is graphical or blank, except controls. 781 782=item B<C<\p{SpacePerl}>> 783 784This is the same as C<\s>, including beyond ASCII. 785 786Mnemonic: Space, as modified by Perl. (It doesn't include the vertical tab 787which both the Posix standard and Unicode consider white space.) 788 789=item B<C<\p{Title}>> and B<C<\p{Titlecase}>> 790 791Under case-sensitive matching, these both match the same code points as 792C<\p{General Category=Titlecase_Letter}> (C<\p{gc=lt}>). The difference 793is that under C</i> caseless matching, these match the same as 794C<\p{Cased}>, whereas C<\p{gc=lt}> matches C<\p{Cased_Letter>). 795 796=item B<C<\p{Unicode}>> 797 798This matches any of the 1_114_112 Unicode code points. 799C<\p{Any}>. 800 801=item B<C<\p{VertSpace}>> 802 803This is the same as C<\v>: A character that changes the spacing vertically. 804 805=item B<C<\p{Word}>> 806 807This is the same as C<\w>, including over 100_000 characters beyond ASCII. 808 809=item B<C<\p{XPosix...}>> 810 811There are several of these, which are the standard Posix classes 812extended to the full Unicode range. They are described in 813L<perlrecharclass/POSIX Character Classes>. 814 815=back 816 817 818=head2 User-Defined Character Properties 819 820You can define your own binary character properties by defining subroutines 821whose names begin with C<"In"> or C<"Is">. (The experimental feature 822L<perlre/(?[ ])> provides an alternative which allows more complex 823definitions.) The subroutines can be defined in any 824package. The user-defined properties can be used in the regular expression 825C<\p{}> and C<\P{}> constructs; if you are using a user-defined property from a 826package other than the one you are in, you must specify its package in the 827C<\p{}> or C<\P{}> construct. 828 829 # assuming property Is_Foreign defined in Lang:: 830 package main; # property package name required 831 if ($txt =~ /\p{Lang::IsForeign}+/) { ... } 832 833 package Lang; # property package name not required 834 if ($txt =~ /\p{IsForeign}+/) { ... } 835 836 837Note that the effect is compile-time and immutable once defined. 838However, the subroutines are passed a single parameter, which is 0 if 839case-sensitive matching is in effect and non-zero if caseless matching 840is in effect. The subroutine may return different values depending on 841the value of the flag, and one set of values will immutably be in effect 842for all case-sensitive matches, and the other set for all case-insensitive 843matches. 844 845Note that if the regular expression is tainted, then Perl will die rather 846than calling the subroutine when the name of the subroutine is 847determined by the tainted data. 848 849The subroutines must return a specially-formatted string, with one 850or more newline-separated lines. Each line must be one of the following: 851 852=over 4 853 854=item * 855 856A single hexadecimal number denoting a code point to include. 857 858=item * 859 860Two hexadecimal numbers separated by horizontal whitespace (space or 861tabular characters) denoting a range of code points to include. 862 863=item * 864 865Something to include, prefixed by C<"+">: a built-in character 866property (prefixed by C<"utf8::">) or a fully qualified (including package 867name) user-defined character property, 868to represent all the characters in that property; two hexadecimal code 869points for a range; or a single hexadecimal code point. 870 871=item * 872 873Something to exclude, prefixed by C<"-">: an existing character 874property (prefixed by C<"utf8::">) or a fully qualified (including package 875name) user-defined character property, 876to represent all the characters in that property; two hexadecimal code 877points for a range; or a single hexadecimal code point. 878 879=item * 880 881Something to negate, prefixed C<"!">: an existing character 882property (prefixed by C<"utf8::">) or a fully qualified (including package 883name) user-defined character property, 884to represent all the characters in that property; two hexadecimal code 885points for a range; or a single hexadecimal code point. 886 887=item * 888 889Something to intersect with, prefixed by C<"&">: an existing character 890property (prefixed by C<"utf8::">) or a fully qualified (including package 891name) user-defined character property, 892for all the characters except the characters in the property; two 893hexadecimal code points for a range; or a single hexadecimal code point. 894 895=back 896 897For example, to define a property that covers both the Japanese 898syllabaries (hiragana and katakana), you can define 899 900 sub InKana { 901 return <<END; 902 3040\t309F 903 30A0\t30FF 904 END 905 } 906 907Imagine that the here-doc end marker is at the beginning of the line. 908Now you can use C<\p{InKana}> and C<\P{InKana}>. 909 910You could also have used the existing block property names: 911 912 sub InKana { 913 return <<'END'; 914 +utf8::InHiragana 915 +utf8::InKatakana 916 END 917 } 918 919Suppose you wanted to match only the allocated characters, 920not the raw block ranges: in other words, you want to remove 921the non-characters: 922 923 sub InKana { 924 return <<'END'; 925 +utf8::InHiragana 926 +utf8::InKatakana 927 -utf8::IsCn 928 END 929 } 930 931The negation is useful for defining (surprise!) negated classes. 932 933 sub InNotKana { 934 return <<'END'; 935 !utf8::InHiragana 936 -utf8::InKatakana 937 +utf8::IsCn 938 END 939 } 940 941This will match all non-Unicode code points, since every one of them is 942not in Kana. You can use intersection to exclude these, if desired, as 943this modified example shows: 944 945 sub InNotKana { 946 return <<'END'; 947 !utf8::InHiragana 948 -utf8::InKatakana 949 +utf8::IsCn 950 &utf8::Any 951 END 952 } 953 954C<&utf8::Any> must be the last line in the definition. 955 956Intersection is used generally for getting the common characters matched 957by two (or more) classes. It's important to remember not to use C<"&"> for 958the first set; that would be intersecting with nothing, resulting in an 959empty set. 960 961Unlike non-user-defined C<\p{}> property matches, no warning is ever 962generated if these properties are matched against a non-Unicode code 963point (see L</Beyond Unicode code points> below). 964 965=head2 User-Defined Case Mappings (for serious hackers only) 966 967B<This feature has been removed as of Perl 5.16.> 968The CPAN module C<L<Unicode::Casing>> provides better functionality without 969the drawbacks that this feature had. If you are using a Perl earlier 970than 5.16, this feature was most fully documented in the 5.14 version of 971this pod: 972L<http://perldoc.perl.org/5.14.0/perlunicode.html#User-Defined-Case-Mappings-%28for-serious-hackers-only%29> 973 974=head2 Character Encodings for Input and Output 975 976See L<Encode>. 977 978=head2 Unicode Regular Expression Support Level 979 980The following list of Unicode supported features for regular expressions describes 981all features currently directly supported by core Perl. The references to "Level N" 982and the section numbers refer to the Unicode Technical Standard #18, 983"Unicode Regular Expressions", version 13, from August 2008. 984 985=over 4 986 987=item * 988 989Level 1 - Basic Unicode Support 990 991 RL1.1 Hex Notation - done [1] 992 RL1.2 Properties - done [2][3] 993 RL1.2a Compatibility Properties - done [4] 994 RL1.3 Subtraction and Intersection - experimental [5] 995 RL1.4 Simple Word Boundaries - done [6] 996 RL1.5 Simple Loose Matches - done [7] 997 RL1.6 Line Boundaries - MISSING [8][9] 998 RL1.7 Supplementary Code Points - done [10] 999 1000=over 4 1001 1002=item [1] 1003 1004C<\x{...}> 1005 1006=item [2] 1007 1008C<\p{...}> C<\P{...}> 1009 1010=item [3] 1011 1012supports not only minimal list, but all Unicode character properties (see Unicode Character Properties above) 1013 1014=item [4] 1015 1016C<\d> C<\D> C<\s> C<\S> C<\w> C<\W> C<\X> C<[:I<prop>:]> C<[:^I<prop>:]> 1017 1018=item [5] 1019 1020The experimental feature in v5.18 C<"(?[...])"> accomplishes this. See 1021L<perlre/(?[ ])>. If you don't want to use an experimental feature, 1022you can use one of the following: 1023 1024=over 4 1025 1026=item * Regular expression look-ahead 1027 1028You can mimic class subtraction using lookahead. 1029For example, what UTS#18 might write as 1030 1031 [{Block=Greek}-[{UNASSIGNED}]] 1032 1033in Perl can be written as: 1034 1035 (?!\p{Unassigned})\p{Block=Greek} 1036 (?=\p{Assigned})\p{Block=Greek} 1037 1038But in this particular example, you probably really want 1039 1040 \p{Greek} 1041 1042which will match assigned characters known to be part of the Greek script. 1043 1044=item * CPAN module C<L<Unicode::Regex::Set>> 1045 1046It does implement the full UTS#18 grouping, intersection, union, and 1047removal (subtraction) syntax. 1048 1049=item * L</"User-Defined Character Properties"> 1050 1051C<"+"> for union, C<"-"> for removal (set-difference), C<"&"> for intersection 1052 1053=back 1054 1055=item [6] 1056 1057C<\b> C<\B> 1058 1059=item [7] 1060 1061Note that Perl does Full case-folding in matching (but with bugs), not 1062Simple: for example C<U+1F88> is equivalent to C<U+1F00 U+03B9>, instead of 1063just C<U+1F80>. This difference matters mainly for certain Greek capital 1064letters with certain modifiers: the Full case-folding decomposes the 1065letter, while the Simple case-folding would map it to a single 1066character. 1067 1068=item [8] 1069 1070Should do C<^> and C<$> also on C<U+000B> (C<\v> in C), C<FF> (C<\f>), 1071C<CR> (C<\r>), C<CRLF> (C<\r\n>), C<NEL> (C<U+0085>), C<LS> (C<U+2028>), 1072and C<PS> (C<U+2029>); should also affect C<E<lt>E<gt>>, C<$.>, and 1073script line numbers; should not split lines within C<CRLF> (i.e. there 1074is no empty line between C<\r> and C<\n>). For C<CRLF>, try the 1075C<:crlf> layer (see L<PerlIO>). 1076 1077=item [9] 1078 1079Linebreaking conformant with L<UAX#14 "Unicode Line Breaking 1080Algorithm"|http://www.unicode.org/reports/tr14> 1081is available through the C<L<Unicode::LineBreak>> module. 1082 1083=item [10] 1084 1085UTF-8/UTF-EBDDIC used in Perl allows not only C<U+10000> to 1086C<U+10FFFF> but also beyond C<U+10FFFF> 1087 1088=back 1089 1090=item * 1091 1092Level 2 - Extended Unicode Support 1093 1094 RL2.1 Canonical Equivalents - MISSING [10][11] 1095 RL2.2 Default Grapheme Clusters - MISSING [12] 1096 RL2.3 Default Word Boundaries - MISSING [14] 1097 RL2.4 Default Loose Matches - MISSING [15] 1098 RL2.5 Name Properties - DONE 1099 RL2.6 Wildcard Properties - MISSING 1100 1101 [10] see UAX#15 "Unicode Normalization Forms" 1102 [11] have Unicode::Normalize but not integrated to regexes 1103 [12] have \X but we don't have a "Grapheme Cluster Mode" 1104 [14] see UAX#29, Word Boundaries 1105 [15] This is covered in Chapter 3.13 (in Unicode 6.0) 1106 1107=item * 1108 1109Level 3 - Tailored Support 1110 1111 RL3.1 Tailored Punctuation - MISSING 1112 RL3.2 Tailored Grapheme Clusters - MISSING [17][18] 1113 RL3.3 Tailored Word Boundaries - MISSING 1114 RL3.4 Tailored Loose Matches - MISSING 1115 RL3.5 Tailored Ranges - MISSING 1116 RL3.6 Context Matching - MISSING [19] 1117 RL3.7 Incremental Matches - MISSING 1118 ( RL3.8 Unicode Set Sharing ) 1119 RL3.9 Possible Match Sets - MISSING 1120 RL3.10 Folded Matching - MISSING [20] 1121 RL3.11 Submatchers - MISSING 1122 1123 [17] see UAX#10 "Unicode Collation Algorithms" 1124 [18] have Unicode::Collate but not integrated to regexes 1125 [19] have (?<=x) and (?=x), but look-aheads or look-behinds 1126 should see outside of the target substring 1127 [20] need insensitive matching for linguistic features other 1128 than case; for example, hiragana to katakana, wide and 1129 narrow, simplified Han to traditional Han (see UTR#30 1130 "Character Foldings") 1131 1132=back 1133 1134=head2 Unicode Encodings 1135 1136Unicode characters are assigned to I<code points>, which are abstract 1137numbers. To use these numbers, various encodings are needed. 1138 1139=over 4 1140 1141=item * 1142 1143UTF-8 1144 1145UTF-8 is a variable-length (1 to 4 bytes), byte-order independent 1146encoding. For ASCII (and we really do mean 7-bit ASCII, not another 11478-bit encoding), UTF-8 is transparent. 1148 1149The following table is from Unicode 3.2. 1150 1151 Code Points 1st Byte 2nd Byte 3rd Byte 4th Byte 1152 1153 U+0000..U+007F 00..7F 1154 U+0080..U+07FF * C2..DF 80..BF 1155 U+0800..U+0FFF E0 * A0..BF 80..BF 1156 U+1000..U+CFFF E1..EC 80..BF 80..BF 1157 U+D000..U+D7FF ED 80..9F 80..BF 1158 U+D800..U+DFFF +++++ utf16 surrogates, not legal utf8 +++++ 1159 U+E000..U+FFFF EE..EF 80..BF 80..BF 1160 U+10000..U+3FFFF F0 * 90..BF 80..BF 80..BF 1161 U+40000..U+FFFFF F1..F3 80..BF 80..BF 80..BF 1162 U+100000..U+10FFFF F4 80..8F 80..BF 80..BF 1163 1164Note the gaps marked by "*" before several of the byte entries above. These are 1165caused by legal UTF-8 avoiding non-shortest encodings: it is technically 1166possible to UTF-8-encode a single code point in different ways, but that is 1167explicitly forbidden, and the shortest possible encoding should always be used 1168(and that is what Perl does). 1169 1170Another way to look at it is via bits: 1171 1172 Code Points 1st Byte 2nd Byte 3rd Byte 4th Byte 1173 1174 0aaaaaaa 0aaaaaaa 1175 00000bbbbbaaaaaa 110bbbbb 10aaaaaa 1176 ccccbbbbbbaaaaaa 1110cccc 10bbbbbb 10aaaaaa 1177 00000dddccccccbbbbbbaaaaaa 11110ddd 10cccccc 10bbbbbb 10aaaaaa 1178 1179As you can see, the continuation bytes all begin with C<"10">, and the 1180leading bits of the start byte tell how many bytes there are in the 1181encoded character. 1182 1183The original UTF-8 specification allowed up to 6 bytes, to allow 1184encoding of numbers up to C<0x7FFF_FFFF>. Perl continues to allow those, 1185and has extended that up to 13 bytes to encode code points up to what 1186can fit in a 64-bit word. However, Perl will warn if you output any of 1187these as being non-portable; and under strict UTF-8 input protocols, 1188they are forbidden. 1189 1190The Unicode non-character code points are also disallowed in UTF-8 in 1191"open interchange". See L</Non-character code points>. 1192 1193=item * 1194 1195UTF-EBCDIC 1196 1197Like UTF-8 but EBCDIC-safe, in the way that UTF-8 is ASCII-safe. 1198 1199=item * 1200 1201UTF-16, UTF-16BE, UTF-16LE, Surrogates, and C<BOM>s (Byte Order Marks) 1202 1203The followings items are mostly for reference and general Unicode 1204knowledge, Perl doesn't use these constructs internally. 1205 1206Like UTF-8, UTF-16 is a variable-width encoding, but where 1207UTF-8 uses 8-bit code units, UTF-16 uses 16-bit code units. 1208All code points occupy either 2 or 4 bytes in UTF-16: code points 1209C<U+0000..U+FFFF> are stored in a single 16-bit unit, and code 1210points C<U+10000..U+10FFFF> in two 16-bit units. The latter case is 1211using I<surrogates>, the first 16-bit unit being the I<high 1212surrogate>, and the second being the I<low surrogate>. 1213 1214Surrogates are code points set aside to encode the C<U+10000..U+10FFFF> 1215range of Unicode code points in pairs of 16-bit units. The I<high 1216surrogates> are the range C<U+D800..U+DBFF> and the I<low surrogates> 1217are the range C<U+DC00..U+DFFF>. The surrogate encoding is 1218 1219 $hi = ($uni - 0x10000) / 0x400 + 0xD800; 1220 $lo = ($uni - 0x10000) % 0x400 + 0xDC00; 1221 1222and the decoding is 1223 1224 $uni = 0x10000 + ($hi - 0xD800) * 0x400 + ($lo - 0xDC00); 1225 1226Because of the 16-bitness, UTF-16 is byte-order dependent. UTF-16 1227itself can be used for in-memory computations, but if storage or 1228transfer is required either UTF-16BE (big-endian) or UTF-16LE 1229(little-endian) encodings must be chosen. 1230 1231This introduces another problem: what if you just know that your data 1232is UTF-16, but you don't know which endianness? Byte Order Marks, or 1233C<BOM>s, are a solution to this. A special character has been reserved 1234in Unicode to function as a byte order marker: the character with the 1235code point C<U+FEFF> is the C<BOM>. 1236 1237The trick is that if you read a C<BOM>, you will know the byte order, 1238since if it was written on a big-endian platform, you will read the 1239bytes C<0xFE 0xFF>, but if it was written on a little-endian platform, 1240you will read the bytes C<0xFF 0xFE>. (And if the originating platform 1241was writing in UTF-8, you will read the bytes C<0xEF 0xBB 0xBF>.) 1242 1243The way this trick works is that the character with the code point 1244C<U+FFFE> is not supposed to be in input streams, so the 1245sequence of bytes C<0xFF 0xFE> is unambiguously "C<BOM>, represented in 1246little-endian format" and cannot be C<U+FFFE>, represented in big-endian 1247format". 1248 1249Surrogates have no meaning in Unicode outside their use in pairs to 1250represent other code points. However, Perl allows them to be 1251represented individually internally, for example by saying 1252C<chr(0xD801)>, so that all code points, not just those valid for open 1253interchange, are 1254representable. Unicode does define semantics for them, such as their 1255C<L</General_Category>> is C<"Cs">. But because their use is somewhat dangerous, 1256Perl will warn (using the warning category C<"surrogate">, which is a 1257sub-category of C<"utf8">) if an attempt is made 1258to do things like take the lower case of one, or match 1259case-insensitively, or to output them. (But don't try this on Perls 1260before 5.14.) 1261 1262=item * 1263 1264UTF-32, UTF-32BE, UTF-32LE 1265 1266The UTF-32 family is pretty much like the UTF-16 family, expect that 1267the units are 32-bit, and therefore the surrogate scheme is not 1268needed. UTF-32 is a fixed-width encoding. The C<BOM> signatures are 1269C<0x00 0x00 0xFE 0xFF> for BE and C<0xFF 0xFE 0x00 0x00> for LE. 1270 1271=item * 1272 1273UCS-2, UCS-4 1274 1275Legacy, fixed-width encodings defined by the ISO 10646 standard. UCS-2 is a 16-bit 1276encoding. Unlike UTF-16, UCS-2 is not extensible beyond C<U+FFFF>, 1277because it does not use surrogates. UCS-4 is a 32-bit encoding, 1278functionally identical to UTF-32 (the difference being that 1279UCS-4 forbids neither surrogates nor code points larger than C<0x10_FFFF>). 1280 1281=item * 1282 1283UTF-7 1284 1285A seven-bit safe (non-eight-bit) encoding, which is useful if the 1286transport or storage is not eight-bit safe. Defined by RFC 2152. 1287 1288=back 1289 1290=head2 Non-character code points 1291 129266 code points are set aside in Unicode as "non-character code points". 1293These all have the C<Unassigned> (C<Cn>) C<L</General_Category>>, and 1294they never will 1295be assigned. These are never supposed to be in legal Unicode input 1296streams, so that code can use them as sentinels that can be mixed in 1297with character data, and they always will be distinguishable from that data. 1298To keep them out of Perl input streams, strict UTF-8 should be 1299specified, such as by using the layer C<:encoding('UTF-8')>. The 1300non-character code points are the 32 between C<U+FDD0> and C<U+FDEF>, and the 130134 code points C<U+FFFE>, C<U+FFFF>, C<U+1FFFE>, C<U+1FFFF>, ... C<U+10FFFE>, C<U+10FFFF>. 1302Some people are under the mistaken impression that these are "illegal", 1303but that is not true. An application or cooperating set of applications 1304can legally use them at will internally; but these code points are 1305"illegal for open interchange". Therefore, Perl will not accept these 1306from input streams unless lax rules are being used, and will warn 1307(using the warning category C<"nonchar">, which is a sub-category of C<"utf8">) if 1308an attempt is made to output them. 1309 1310=head2 Beyond Unicode code points 1311 1312The maximum Unicode code point is C<U+10FFFF>, and Unicode only defines 1313operations on code points up through that. But Perl works on code 1314points up to the maximum permissible unsigned number available on the 1315platform. However, Perl will not accept these from input streams unless 1316lax rules are being used, and will warn (using the warning category 1317C<"non_unicode">, which is a sub-category of C<"utf8">) if any are output. 1318 1319Since Unicode rules are not defined on these code points, if a 1320Unicode-defined operation is done on them, Perl uses what we believe are 1321sensible rules, while generally warning, using the C<"non_unicode"> 1322category. For example, C<uc("\x{11_0000}")> will generate such a 1323warning, returning the input parameter as its result, since Perl defines 1324the uppercase of every non-Unicode code point to be the code point 1325itself. In fact, all the case changing operations, not just 1326uppercasing, work this way. 1327 1328The situation with matching Unicode properties in regular expressions, 1329the C<\p{}> and C<\P{}> constructs, against these code points is not as 1330clear cut, and how these are handled has changed as we've gained 1331experience. 1332 1333One possibility is to treat any match against these code points as 1334undefined. But since Perl doesn't have the concept of a match being 1335undefined, it converts this to failing or C<FALSE>. This is almost, but 1336not quite, what Perl did from v5.14 (when use of these code points 1337became generally reliable) through v5.18. The difference is that Perl 1338treated all C<\p{}> matches as failing, but all C<\P{}> matches as 1339succeeding. 1340 1341One problem with this is that it leads to unexpected, and confusting 1342results in some cases: 1343 1344 chr(0x110000) =~ \p{ASCII_Hex_Digit=True} # Failed on <= v5.18 1345 chr(0x110000) =~ \p{ASCII_Hex_Digit=False} # Failed! on <= v5.18 1346 1347That is, it treated both matches as undefined, and converted that to 1348false (raising a warning on each). The first case is the expected 1349result, but the second is likely counterintuitive: "How could both be 1350false when they are complements?" Another problem was that the 1351implementation optimized many Unicode property matches down to already 1352existing simpler, faster operations, which don't raise the warning. We 1353chose to not forgo those optimizations, which help the vast majority of 1354matches, just to generate a warning for the unlikely event that an 1355above-Unicode code point is being matched against. 1356 1357As a result of these problems, starting in v5.20, what Perl does is 1358to treat non-Unicode code points as just typical unassigned Unicode 1359characters, and matches accordingly. (Note: Unicode has atypical 1360unassigned code points. For example, it has non-character code points, 1361and ones that, when they do get assigned, are destined to be written 1362Right-to-left, as Arabic and Hebrew are. Perl assumes that no 1363non-Unicode code point has any atypical properties.) 1364 1365Perl, in most cases, will raise a warning when matching an above-Unicode 1366code point against a Unicode property when the result is C<TRUE> for 1367C<\p{}>, and C<FALSE> for C<\P{}>. For example: 1368 1369 chr(0x110000) =~ \p{ASCII_Hex_Digit=True} # Fails, no warning 1370 chr(0x110000) =~ \p{ASCII_Hex_Digit=False} # Succeeds, with warning 1371 1372In both these examples, the character being matched is non-Unicode, so 1373Unicode doesn't define how it should match. It clearly isn't an ASCII 1374hex digit, so the first example clearly should fail, and so it does, 1375with no warning. But it is arguable that the second example should have 1376an undefined, hence C<FALSE>, result. So a warning is raised for it. 1377 1378Thus the warning is raised for many fewer cases than in earlier Perls, 1379and only when what the result is could be arguable. It turns out that 1380none of the optimizations made by Perl (or are ever likely to be made) 1381cause the warning to be skipped, so it solves both problems of Perl's 1382earlier approach. The most commonly used property that is affected by 1383this change is C<\p{Unassigned}> which is a short form for 1384C<\p{General_Category=Unassigned}>. Starting in v5.20, all non-Unicode 1385code points are considered C<Unassigned>. In earlier releases the 1386matches failed because the result was considered undefined. 1387 1388The only place where the warning is not raised when it might ought to 1389have been is if optimizations cause the whole pattern match to not even 1390be attempted. For example, Perl may figure out that for a string to 1391match a certain regular expression pattern, the string has to contain 1392the substring C<"foobar">. Before attempting the match, Perl may look 1393for that substring, and if not found, immediately fail the match without 1394actually trying it; so no warning gets generated even if the string 1395contains an above-Unicode code point. 1396 1397This behavior is more "Do what I mean" than in earlier Perls for most 1398applications. But it catches fewer issues for code that needs to be 1399strictly Unicode compliant. Therefore there is an additional mode of 1400operation available to accommodate such code. This mode is enabled if a 1401regular expression pattern is compiled within the lexical scope where 1402the C<"non_unicode"> warning class has been made fatal, say by: 1403 1404 use warnings FATAL => "non_unicode" 1405 1406(see L<warnings>). In this mode of operation, Perl will raise the 1407warning for all matches against a non-Unicode code point (not just the 1408arguable ones), and it skips the optimizations that might cause the 1409warning to not be output. (It currently still won't warn if the match 1410isn't even attempted, like in the C<"foobar"> example above.) 1411 1412In summary, Perl now normally treats non-Unicode code points as typical 1413Unicode unassigned code points for regular expression matches, raising a 1414warning only when it is arguable what the result should be. However, if 1415this warning has been made fatal, it isn't skipped. 1416 1417There is one exception to all this. C<\p{All}> looks like a Unicode 1418property, but it is a Perl extension that is defined to be true for all 1419possible code points, Unicode or not, so no warning is ever generated 1420when matching this against a non-Unicode code point. (Prior to v5.20, 1421it was an exact synonym for C<\p{Any}>, matching code points C<0> 1422through C<0x10FFFF>.) 1423 1424=head2 Security Implications of Unicode 1425 1426Read L<Unicode Security Considerations|http://www.unicode.org/reports/tr36>. 1427Also, note the following: 1428 1429=over 4 1430 1431=item * 1432 1433Malformed UTF-8 1434 1435Unfortunately, the original specification of UTF-8 leaves some room for 1436interpretation of how many bytes of encoded output one should generate 1437from one input Unicode character. Strictly speaking, the shortest 1438possible sequence of UTF-8 bytes should be generated, 1439because otherwise there is potential for an input buffer overflow at 1440the receiving end of a UTF-8 connection. Perl always generates the 1441shortest length UTF-8, and with warnings on, Perl will warn about 1442non-shortest length UTF-8 along with other malformations, such as the 1443surrogates, which are not Unicode code points valid for interchange. 1444 1445=item * 1446 1447Regular expression pattern matching may surprise you if you're not 1448accustomed to Unicode. Starting in Perl 5.14, several pattern 1449modifiers are available to control this, called the character set 1450modifiers. Details are given in L<perlre/Character set modifiers>. 1451 1452=back 1453 1454As discussed elsewhere, Perl has one foot (two hooves?) planted in 1455each of two worlds: the old world of bytes and the new world of 1456characters, upgrading from bytes to characters when necessary. 1457If your legacy code does not explicitly use Unicode, no automatic 1458switch-over to characters should happen. Characters shouldn't get 1459downgraded to bytes, either. It is possible to accidentally mix bytes 1460and characters, however (see L<perluniintro>), in which case C<\w> in 1461regular expressions might start behaving differently (unless the C</a> 1462modifier is in effect). Review your code. Use warnings and the C<strict> pragma. 1463 1464=head2 Unicode in Perl on EBCDIC 1465 1466The way Unicode is handled on EBCDIC platforms is still 1467experimental. On such platforms, references to UTF-8 encoding in this 1468document and elsewhere should be read as meaning the UTF-EBCDIC 1469specified in Unicode Technical Report 16, unless ASCII vs. EBCDIC issues 1470are specifically discussed. There is no C<utfebcdic> pragma or 1471C<":utfebcdic"> layer; rather, C<"utf8"> and C<":utf8"> are reused to mean 1472the platform's "natural" 8-bit encoding of Unicode. See L<perlebcdic> 1473for more discussion of the issues. 1474 1475=head2 Locales 1476 1477See L<perllocale/Unicode and UTF-8> 1478 1479=head2 When Unicode Does Not Happen 1480 1481While Perl does have extensive ways to input and output in Unicode, 1482and a few other "entry points" like the C<@ARGV> array (which can sometimes be 1483interpreted as UTF-8), there are still many places where Unicode 1484(in some encoding or another) could be given as arguments or received as 1485results, or both, but it is not. 1486 1487The following are such interfaces. Also, see L</The "Unicode Bug">. 1488For all of these interfaces Perl 1489currently (as of v5.16.0) simply assumes byte strings both as arguments 1490and results, or UTF-8 strings if the (problematic) C<encoding> pragma has been used. 1491 1492One reason that Perl does not attempt to resolve the role of Unicode in 1493these situations is that the answers are highly dependent on the operating 1494system and the file system(s). For example, whether filenames can be 1495in Unicode and in exactly what kind of encoding, is not exactly a 1496portable concept. Similarly for C<qx> and C<system>: how well will the 1497"command-line interface" (and which of them?) handle Unicode? 1498 1499=over 4 1500 1501=item * 1502 1503C<chdir>, C<chmod>, C<chown>, C<chroot>, C<exec>, C<link>, C<lstat>, C<mkdir>, 1504C<rename>, C<rmdir>, C<stat>, C<symlink>, C<truncate>, C<unlink>, C<utime>, C<-X> 1505 1506=item * 1507 1508C<%ENV> 1509 1510=item * 1511 1512C<glob> (aka the C<E<lt>*E<gt>>) 1513 1514=item * 1515 1516C<open>, C<opendir>, C<sysopen> 1517 1518=item * 1519 1520C<qx> (aka the backtick operator), C<system> 1521 1522=item * 1523 1524C<readdir>, C<readlink> 1525 1526=back 1527 1528=head2 The "Unicode Bug" 1529 1530The term, "Unicode bug" has been applied to an inconsistency 1531on ASCII platforms with the 1532Unicode code points in the C<Latin-1 Supplement> block, that 1533is, between 128 and 255. Without a locale specified, unlike all other 1534characters or code points, these characters have very different semantics in 1535byte semantics versus character semantics, unless 1536C<use feature 'unicode_strings'> is specified, directly or indirectly. 1537(It is indirectly specified by a C<use v5.12> or higher.) 1538 1539In character semantics these upper-Latin1 characters are interpreted as 1540Unicode code points, which means 1541they have the same semantics as Latin-1 (ISO-8859-1). 1542 1543In byte semantics (without C<unicode_strings>), they are considered to 1544be unassigned characters, meaning that the only semantics they have is 1545their ordinal numbers, and that they are 1546not members of various character classes. None are considered to match C<\w> 1547for example, but all match C<\W>. 1548 1549Perl 5.12.0 added C<unicode_strings> to force character semantics on 1550these code points in some circumstances, which fixed portions of the 1551bug; Perl 5.14.0 fixed almost all of it; and Perl 5.16.0 fixed the 1552remainder (so far as we know, anyway). The lesson here is to enable 1553C<unicode_strings> to avoid the headaches described below. 1554 1555The old, problematic behavior affects these areas: 1556 1557=over 4 1558 1559=item * 1560 1561Changing the case of a scalar, that is, using C<uc()>, C<ucfirst()>, C<lc()>, 1562and C<lcfirst()>, or C<\L>, C<\U>, C<\u> and C<\l> in double-quotish 1563contexts, such as regular expression substitutions. 1564Under C<unicode_strings> starting in Perl 5.12.0, character semantics are 1565generally used. See L<perlfunc/lc> for details on how this works 1566in combination with various other pragmas. 1567 1568=item * 1569 1570Using caseless (C</i>) regular expression matching. 1571Starting in Perl 5.14.0, regular expressions compiled within 1572the scope of C<unicode_strings> use character semantics 1573even when executed or compiled into larger 1574regular expressions outside the scope. 1575 1576=item * 1577 1578Matching any of several properties in regular expressions, namely C<\b>, 1579C<\B>, C<\s>, C<\S>, C<\w>, C<\W>, and all the Posix character classes 1580I<except> C<[[:ascii:]]>. 1581Starting in Perl 5.14.0, regular expressions compiled within 1582the scope of C<unicode_strings> use character semantics 1583even when executed or compiled into larger 1584regular expressions outside the scope. 1585 1586=item * 1587 1588In C<quotemeta> or its inline equivalent C<\Q>, no code points above 127 1589are quoted in UTF-8 encoded strings, but in byte encoded strings, code 1590points between 128-255 are always quoted. 1591Starting in Perl 5.16.0, consistent quoting rules are used within the 1592scope of C<unicode_strings>, as described in L<perlfunc/quotemeta>. 1593 1594=back 1595 1596This behavior can lead to unexpected results in which a string's semantics 1597suddenly change if a code point above 255 is appended to or removed from it, 1598which changes the string's semantics from byte to character or vice versa. As 1599an example, consider the following program and its output: 1600 1601 $ perl -le' 1602 no feature 'unicode_strings'; 1603 $s1 = "\xC2"; 1604 $s2 = "\x{2660}"; 1605 for ($s1, $s2, $s1.$s2) { 1606 print /\w/ || 0; 1607 } 1608 ' 1609 0 1610 0 1611 1 1612 1613If there's no C<\w> in C<s1> or in C<s2>, why does their concatenation have one? 1614 1615This anomaly stems from Perl's attempt to not disturb older programs that 1616didn't use Unicode, and hence had no semantics for characters outside of the 1617ASCII range (except in a locale), along with Perl's desire to add Unicode 1618support seamlessly. The result wasn't seamless: these characters were 1619orphaned. 1620 1621For Perls earlier than those described above, or when a string is passed 1622to a function outside the subpragma's scope, a workaround is to always 1623call L<C<utf8::upgrade($string)>|utf8/Utility functions>, 1624or to use the standard module L<Encode>. Also, a scalar that has any characters 1625whose ordinal is C<0x100> or above, or which were specified using either of the 1626C<\N{...}> notations, will automatically have character semantics. 1627 1628=head2 Forcing Unicode in Perl (Or Unforcing Unicode in Perl) 1629 1630Sometimes (see L</"When Unicode Does Not Happen"> or L</The "Unicode Bug">) 1631there are situations where you simply need to force a byte 1632string into UTF-8, or vice versa. The low-level calls 1633L<C<utf8::upgrade($bytestring)>|utf8/Utility functions> and 1634L<C<utf8::downgrade($utf8string[, FAIL_OK])>|utf8/Utility functions> are 1635the answers. 1636 1637Note that C<utf8::downgrade()> can fail if the string contains characters 1638that don't fit into a byte. 1639 1640Calling either function on a string that already is in the desired state is a 1641no-op. 1642 1643=head2 Using Unicode in XS 1644 1645If you want to handle Perl Unicode in XS extensions, you may find the 1646following C APIs useful. See also L<perlguts/"Unicode Support"> for an 1647explanation about Unicode at the XS level, and L<perlapi> for the API 1648details. 1649 1650=over 4 1651 1652=item * 1653 1654C<DO_UTF8(sv)> returns true if the C<UTF8> flag is on and the bytes 1655pragma is not in effect. C<SvUTF8(sv)> returns true if the C<UTF8> 1656flag is on; the C<bytes> pragma is ignored. The C<UTF8> flag being on 1657does B<not> mean that there are any characters of code points greater 1658than 255 (or 127) in the scalar or that there are even any characters 1659in the scalar. What the C<UTF8> flag means is that the sequence of 1660octets in the representation of the scalar is the sequence of UTF-8 1661encoded code points of the characters of a string. The C<UTF8> flag 1662being off means that each octet in this representation encodes a 1663single character with code point 0..255 within the string. Perl's 1664Unicode model is not to use UTF-8 until it is absolutely necessary. 1665 1666=item * 1667 1668C<uvchr_to_utf8(buf, chr)> writes a Unicode character code point into 1669a buffer encoding the code point as UTF-8, and returns a pointer 1670pointing after the UTF-8 bytes. It works appropriately on EBCDIC machines. 1671 1672=item * 1673 1674C<utf8_to_uvchr_buf(buf, bufend, lenp)> reads UTF-8 encoded bytes from a 1675buffer and 1676returns the Unicode character code point and, optionally, the length of 1677the UTF-8 byte sequence. It works appropriately on EBCDIC machines. 1678 1679=item * 1680 1681C<utf8_length(start, end)> returns the length of the UTF-8 encoded buffer 1682in characters. C<sv_len_utf8(sv)> returns the length of the UTF-8 encoded 1683scalar. 1684 1685=item * 1686 1687C<sv_utf8_upgrade(sv)> converts the string of the scalar to its UTF-8 1688encoded form. C<sv_utf8_downgrade(sv)> does the opposite, if 1689possible. C<sv_utf8_encode(sv)> is like sv_utf8_upgrade except that 1690it does not set the C<UTF8> flag. C<sv_utf8_decode()> does the 1691opposite of C<sv_utf8_encode()>. Note that none of these are to be 1692used as general-purpose encoding or decoding interfaces: C<use Encode> 1693for that. C<sv_utf8_upgrade()> is affected by the encoding pragma 1694but C<sv_utf8_downgrade()> is not (since the encoding pragma is 1695designed to be a one-way street). 1696 1697=item * 1698 1699C<is_utf8_string(buf, len)> returns true if C<len> bytes of the buffer 1700are valid UTF-8. 1701 1702=item * 1703 1704C<is_utf8_char_buf(buf, buf_end)> returns true if the pointer points to 1705a valid UTF-8 character. 1706 1707=item * 1708 1709C<UTF8SKIP(buf)> will return the number of bytes in the UTF-8 encoded 1710character in the buffer. C<UNISKIP(chr)> will return the number of bytes 1711required to UTF-8-encode the Unicode character code point. C<UTF8SKIP()> 1712is useful for example for iterating over the characters of a UTF-8 1713encoded buffer; C<UNISKIP()> is useful, for example, in computing 1714the size required for a UTF-8 encoded buffer. 1715 1716=item * 1717 1718C<utf8_distance(a, b)> will tell the distance in characters between the 1719two pointers pointing to the same UTF-8 encoded buffer. 1720 1721=item * 1722 1723C<utf8_hop(s, off)> will return a pointer to a UTF-8 encoded buffer 1724that is C<off> (positive or negative) Unicode characters displaced 1725from the UTF-8 buffer C<s>. Be careful not to overstep the buffer: 1726C<utf8_hop()> will merrily run off the end or the beginning of the 1727buffer if told to do so. 1728 1729=item * 1730 1731C<pv_uni_display(dsv, spv, len, pvlim, flags)> and 1732C<sv_uni_display(dsv, ssv, pvlim, flags)> are useful for debugging the 1733output of Unicode strings and scalars. By default they are useful 1734only for debugging--they display B<all> characters as hexadecimal code 1735points--but with the flags C<UNI_DISPLAY_ISPRINT>, 1736C<UNI_DISPLAY_BACKSLASH>, and C<UNI_DISPLAY_QQ> you can make the 1737output more readable. 1738 1739=item * 1740 1741C<foldEQ_utf8(s1, pe1, l1, u1, s2, pe2, l2, u2)> can be used to 1742compare two strings case-insensitively in Unicode. For case-sensitive 1743comparisons you can just use C<memEQ()> and C<memNE()> as usual, except 1744if one string is in utf8 and the other isn't. 1745 1746=back 1747 1748For more information, see L<perlapi>, and F<utf8.c> and F<utf8.h> 1749in the Perl source code distribution. 1750 1751=head2 Hacking Perl to work on earlier Unicode versions (for very serious hackers only) 1752 1753Perl by default comes with the latest supported Unicode version built in, but 1754you can change to use any earlier one. 1755 1756Download the files in the desired version of Unicode from the Unicode web 1757site L<http://www.unicode.org>). These should replace the existing files in 1758F<lib/unicore> in the Perl source tree. Follow the instructions in 1759F<README.perl> in that directory to change some of their names, and then build 1760perl (see L<INSTALL>). 1761 1762=head1 BUGS 1763 1764=head2 Interaction with Locales 1765 1766See L<perllocale/Unicode and UTF-8> 1767 1768=head2 Problems with characters in the Latin-1 Supplement range 1769 1770See L</The "Unicode Bug"> 1771 1772=head2 Interaction with Extensions 1773 1774When Perl exchanges data with an extension, the extension should be 1775able to understand the UTF8 flag and act accordingly. If the 1776extension doesn't recognize that flag, it's likely that the extension 1777will return incorrectly-flagged data. 1778 1779So if you're working with Unicode data, consult the documentation of 1780every module you're using if there are any issues with Unicode data 1781exchange. If the documentation does not talk about Unicode at all, 1782suspect the worst and probably look at the source to learn how the 1783module is implemented. Modules written completely in Perl shouldn't 1784cause problems. Modules that directly or indirectly access code written 1785in other programming languages are at risk. 1786 1787For affected functions, the simple strategy to avoid data corruption is 1788to always make the encoding of the exchanged data explicit. Choose an 1789encoding that you know the extension can handle. Convert arguments passed 1790to the extensions to that encoding and convert results back from that 1791encoding. Write wrapper functions that do the conversions for you, so 1792you can later change the functions when the extension catches up. 1793 1794To provide an example, let's say the popular C<Foo::Bar::escape_html> 1795function doesn't deal with Unicode data yet. The wrapper function 1796would convert the argument to raw UTF-8 and convert the result back to 1797Perl's internal representation like so: 1798 1799 sub my_escape_html ($) { 1800 my($what) = shift; 1801 return unless defined $what; 1802 Encode::decode_utf8(Foo::Bar::escape_html( 1803 Encode::encode_utf8($what))); 1804 } 1805 1806Sometimes, when the extension does not convert data but just stores 1807and retrieves them, you will be able to use the otherwise 1808dangerous L<C<Encode::_utf8_on()>|Encode/_utf8_on> function. Let's say 1809the popular C<Foo::Bar> extension, written in C, provides a C<param> 1810method that lets you store and retrieve data according to these prototypes: 1811 1812 $self->param($name, $value); # set a scalar 1813 $value = $self->param($name); # retrieve a scalar 1814 1815If it does not yet provide support for any encoding, one could write a 1816derived class with such a C<param> method: 1817 1818 sub param { 1819 my($self,$name,$value) = @_; 1820 utf8::upgrade($name); # make sure it is UTF-8 encoded 1821 if (defined $value) { 1822 utf8::upgrade($value); # make sure it is UTF-8 encoded 1823 return $self->SUPER::param($name,$value); 1824 } else { 1825 my $ret = $self->SUPER::param($name); 1826 Encode::_utf8_on($ret); # we know, it is UTF-8 encoded 1827 return $ret; 1828 } 1829 } 1830 1831Some extensions provide filters on data entry/exit points, such as 1832C<DB_File::filter_store_key> and family. Look out for such filters in 1833the documentation of your extensions, they can make the transition to 1834Unicode data much easier. 1835 1836=head2 Speed 1837 1838Some functions are slower when working on UTF-8 encoded strings than 1839on byte encoded strings. All functions that need to hop over 1840characters such as C<length()>, C<substr()> or C<index()>, or matching 1841regular expressions can work B<much> faster when the underlying data are 1842byte-encoded. 1843 1844In Perl 5.8.0 the slowness was often quite spectacular; in Perl 5.8.1 1845a caching scheme was introduced which will hopefully make the slowness 1846somewhat less spectacular, at least for some operations. In general, 1847operations with UTF-8 encoded strings are still slower. As an example, 1848the Unicode properties (character classes) like C<\p{Nd}> are known to 1849be quite a bit slower (5-20 times) than their simpler counterparts 1850like C<\d> (then again, there are hundreds of Unicode characters matching C<Nd> 1851compared with the 10 ASCII characters matching C<d>). 1852 1853=head2 Problems on EBCDIC platforms 1854 1855There are several known problems with Perl on EBCDIC platforms. If you 1856want to use Perl there, send email to perlbug@perl.org. 1857 1858In earlier versions, when byte and character data were concatenated, 1859the new string was sometimes created by 1860decoding the byte strings as I<ISO 8859-1 (Latin-1)>, even if the 1861old Unicode string used EBCDIC. 1862 1863If you find any of these, please report them as bugs. 1864 1865=head2 Porting code from perl-5.6.X 1866 1867Perl 5.8 has a different Unicode model from 5.6. In 5.6 the programmer 1868was required to use the C<utf8> pragma to declare that a given scope 1869expected to deal with Unicode data and had to make sure that only 1870Unicode data were reaching that scope. If you have code that is 1871working with 5.6, you will need some of the following adjustments to 1872your code. The examples are written such that the code will continue 1873to work under 5.6, so you should be safe to try them out. 1874 1875=over 3 1876 1877=item * 1878 1879A filehandle that should read or write UTF-8 1880 1881 if ($] > 5.008) { 1882 binmode $fh, ":encoding(utf8)"; 1883 } 1884 1885=item * 1886 1887A scalar that is going to be passed to some extension 1888 1889Be it C<Compress::Zlib>, C<Apache::Request> or any extension that has no 1890mention of Unicode in the manpage, you need to make sure that the 1891UTF8 flag is stripped off. Note that at the time of this writing 1892(January 2012) the mentioned modules are not UTF-8-aware. Please 1893check the documentation to verify if this is still true. 1894 1895 if ($] > 5.008) { 1896 require Encode; 1897 $val = Encode::encode_utf8($val); # make octets 1898 } 1899 1900=item * 1901 1902A scalar we got back from an extension 1903 1904If you believe the scalar comes back as UTF-8, you will most likely 1905want the UTF8 flag restored: 1906 1907 if ($] > 5.008) { 1908 require Encode; 1909 $val = Encode::decode_utf8($val); 1910 } 1911 1912=item * 1913 1914Same thing, if you are really sure it is UTF-8 1915 1916 if ($] > 5.008) { 1917 require Encode; 1918 Encode::_utf8_on($val); 1919 } 1920 1921=item * 1922 1923A wrapper for L<DBI> C<fetchrow_array> and C<fetchrow_hashref> 1924 1925When the database contains only UTF-8, a wrapper function or method is 1926a convenient way to replace all your C<fetchrow_array> and 1927C<fetchrow_hashref> calls. A wrapper function will also make it easier to 1928adapt to future enhancements in your database driver. Note that at the 1929time of this writing (January 2012), the DBI has no standardized way 1930to deal with UTF-8 data. Please check the L<DBI documentation|DBI> to verify if 1931that is still true. 1932 1933 sub fetchrow { 1934 # $what is one of fetchrow_{array,hashref} 1935 my($self, $sth, $what) = @_; 1936 if ($] < 5.008) { 1937 return $sth->$what; 1938 } else { 1939 require Encode; 1940 if (wantarray) { 1941 my @arr = $sth->$what; 1942 for (@arr) { 1943 defined && /[^\000-\177]/ && Encode::_utf8_on($_); 1944 } 1945 return @arr; 1946 } else { 1947 my $ret = $sth->$what; 1948 if (ref $ret) { 1949 for my $k (keys %$ret) { 1950 defined 1951 && /[^\000-\177]/ 1952 && Encode::_utf8_on($_) for $ret->{$k}; 1953 } 1954 return $ret; 1955 } else { 1956 defined && /[^\000-\177]/ && Encode::_utf8_on($_) for $ret; 1957 return $ret; 1958 } 1959 } 1960 } 1961 } 1962 1963 1964=item * 1965 1966A large scalar that you know can only contain ASCII 1967 1968Scalars that contain only ASCII and are marked as UTF-8 are sometimes 1969a drag to your program. If you recognize such a situation, just remove 1970the UTF8 flag: 1971 1972 utf8::downgrade($val) if $] > 5.008; 1973 1974=back 1975 1976=head1 SEE ALSO 1977 1978L<perlunitut>, L<perluniintro>, L<perluniprops>, L<Encode>, L<open>, L<utf8>, L<bytes>, 1979L<perlretut>, L<perlvar/"${^UNICODE}"> 1980L<http://www.unicode.org/reports/tr44>). 1981 1982=cut 1983