xref: /openbsd/gnu/usr.bin/perl/pod/perlunicode.pod (revision fc61954a)
1=head1 NAME
2
3perlunicode - Unicode support in Perl
4
5=head1 DESCRIPTION
6
7=head2 Important Caveats
8
9Unicode support is an extensive requirement. While Perl does not
10implement the Unicode standard or the accompanying technical reports
11from cover to cover, Perl does support many Unicode features.
12
13People who want to learn to use Unicode in Perl, should probably read
14the L<Perl Unicode tutorial, perlunitut|perlunitut> and
15L<perluniintro>, before reading
16this reference document.
17
18Also, the use of Unicode may present security issues that aren't obvious.
19Read L<Unicode Security Considerations|http://www.unicode.org/reports/tr36>.
20
21=over 4
22
23=item Safest if you C<use feature 'unicode_strings'>
24
25In order to preserve backward compatibility, Perl does not turn
26on full internal Unicode support unless the pragma
27C<use feature 'unicode_strings'> is specified.  (This is automatically
28selected if you use C<use 5.012> or higher.)  Failure to do this can
29trigger unexpected surprises.  See L</The "Unicode Bug"> below.
30
31This pragma doesn't affect I/O.  Nor does it change the internal
32representation of strings, only their interpretation.  There are still
33several places where Unicode isn't fully supported, such as in
34filenames.
35
36=item Input and Output Layers
37
38Perl knows when a filehandle uses Perl's internal Unicode encodings
39(UTF-8, or UTF-EBCDIC if in EBCDIC) if the filehandle is opened with
40the C<:encoding(utf8)> layer.  Other encodings can be converted to Perl's
41encoding on input or from Perl's encoding on output by use of the
42C<:encoding(...)>  layer.  See L<open>.
43
44To indicate that Perl source itself is in UTF-8, use C<use utf8;>.
45
46=item C<use utf8> still needed to enable UTF-8/UTF-EBCDIC in scripts
47
48As a compatibility measure, the C<use utf8> pragma must be explicitly
49included to enable recognition of UTF-8 in the Perl scripts themselves
50(in string or regular expression literals, or in identifier names) on
51ASCII-based machines or to recognize UTF-EBCDIC on EBCDIC-based
52machines.  B<These are the only times when an explicit C<use utf8>
53is needed.>  See L<utf8>.
54
55=item C<BOM>-marked scripts and UTF-16 scripts autodetected
56
57If a Perl script begins marked with the Unicode C<BOM> (UTF-16LE, UTF16-BE,
58or UTF-8), or if the script looks like non-C<BOM>-marked UTF-16 of either
59endianness, Perl will correctly read in the script as Unicode.
60(C<BOM>less UTF-8 cannot be effectively recognized or differentiated from
61ISO 8859-1 or other eight-bit encodings.)
62
63=item C<use encoding> needed to upgrade non-Latin-1 byte strings
64
65By default, there is a fundamental asymmetry in Perl's Unicode model:
66implicit upgrading from byte strings to Unicode strings assumes that
67they were encoded in I<ISO 8859-1 (Latin-1)>, but Unicode strings are
68downgraded with UTF-8 encoding.  This happens because the first 256
69codepoints in Unicode happens to agree with Latin-1.
70
71See L</"Byte and Character Semantics"> for more details.
72
73=back
74
75=head2 Byte and Character Semantics
76
77Perl uses logically-wide characters to represent strings internally.
78
79Starting in Perl 5.14, Perl-level operations work with
80characters rather than bytes within the scope of a
81C<L<use feature 'unicode_strings'|feature>> (or equivalently
82C<use 5.012> or higher).  (This is not true if bytes have been
83explicitly requested by C<L<use bytes|bytes>>, nor necessarily true
84for interactions with the platform's operating system.)
85
86For earlier Perls, and when C<unicode_strings> is not in effect, Perl
87provides a fairly safe environment that can handle both types of
88semantics in programs.  For operations where Perl can unambiguously
89decide that the input data are characters, Perl switches to character
90semantics.  For operations where this determination cannot be made
91without additional information from the user, Perl decides in favor of
92compatibility and chooses to use byte semantics.
93
94When C<use locale> (but not C<use locale ':not_characters'>) is in
95effect, Perl uses the rules associated with the current locale.
96(C<use locale> overrides C<use feature 'unicode_strings'> in the same scope;
97while C<use locale ':not_characters'> effectively also selects
98C<use feature 'unicode_strings'> in its scope; see L<perllocale>.)
99Otherwise, Perl uses the platform's native
100byte semantics for characters whose code points are less than 256, and
101Unicode rules for those greater than 255.  That means that non-ASCII
102characters are undefined except for their
103ordinal numbers.  This means that none have case (upper and lower), nor are any
104a member of character classes, like C<[:alpha:]> or C<\w>.  (But all do belong
105to the C<\W> class or the Perl regular expression extension C<[:^alpha:]>.)
106
107This behavior preserves compatibility with earlier versions of Perl,
108which allowed byte semantics in Perl operations only if
109none of the program's inputs were marked as being a source of Unicode
110character data.  Such data may come from filehandles, from calls to
111external programs, from information provided by the system (such as C<%ENV>),
112or from literals and constants in the source text.
113
114The C<utf8> pragma is primarily a compatibility device that enables
115recognition of UTF-(8|EBCDIC) in literals encountered by the parser.
116Note that this pragma is only required while Perl defaults to byte
117semantics; when character semantics become the default, this pragma
118may become a no-op.  See L<utf8>.
119
120If strings operating under byte semantics and strings with Unicode
121character data are concatenated, the new string will have
122character semantics.  This can cause surprises: See L</BUGS>, below.
123You can choose to be warned when this happens.  See C<L<encoding::warnings>>.
124
125Under character semantics, many operations that formerly operated on
126bytes now operate on characters. A character in Perl is
127logically just a number ranging from 0 to 2**31 or so. Larger
128characters may encode into longer sequences of bytes internally, but
129this internal detail is mostly hidden for Perl code.
130See L<perluniintro> for more.
131
132=head2 Effects of Character Semantics
133
134Character semantics have the following effects:
135
136=over 4
137
138=item *
139
140Strings--including hash keys--and regular expression patterns may
141contain characters that have an ordinal value larger than 255.
142
143If you use a Unicode editor to edit your program, Unicode characters may
144occur directly within the literal strings in UTF-8 encoding, or UTF-16.
145(The former requires a C<BOM> or C<use utf8>, the latter requires a C<BOM>.)
146
147Unicode characters can also be added to a string by using the C<\N{U+...}>
148notation.  The Unicode code for the desired character, in hexadecimal,
149should be placed in the braces, after the C<U>. For instance, a smiley face is
150C<\N{U+263A}>.
151
152Alternatively, you can use the C<\x{...}> notation for characters C<0x100> and
153above.  For characters below C<0x100> you may get byte semantics instead of
154character semantics;  see L</The "Unicode Bug">.  On EBCDIC machines there is
155the additional problem that the value for such characters gives the EBCDIC
156character rather than the Unicode one, thus it is more portable to use
157C<\N{U+...}> instead.
158
159Additionally, you can use the C<\N{...}> notation and put the official
160Unicode character name within the braces, such as
161C<\N{WHITE SMILING FACE}>.  This automatically loads the L<charnames>
162module with the C<:full> and C<:short> options.  If you prefer different
163options for this module, you can instead, before the C<\N{...}>,
164explicitly load it with your desired options; for example,
165
166   use charnames ':loose';
167
168=item *
169
170If an appropriate L<encoding> is specified, identifiers within the
171Perl script may contain Unicode alphanumeric characters, including
172ideographs.  Perl does not currently attempt to canonicalize variable
173names.
174
175=item *
176
177Regular expressions match characters instead of bytes.  C<"."> matches
178a character instead of a byte.
179
180=item *
181
182Bracketed character classes in regular expressions match characters instead of
183bytes and match against the character properties specified in the
184Unicode properties database.  C<\w> can be used to match a Japanese
185ideograph, for instance.
186
187=item *
188
189Named Unicode properties, scripts, and block ranges may be used (like bracketed
190character classes) by using the C<\p{}> "matches property" construct and
191the C<\P{}> negation, "doesn't match property".
192See L</"Unicode Character Properties"> for more details.
193
194You can define your own character properties and use them
195in the regular expression with the C<\p{}> or C<\P{}> construct.
196See L</"User-Defined Character Properties"> for more details.
197
198=item *
199
200The special pattern C<\X> matches a logical character, an "extended grapheme
201cluster" in Standardese.  In Unicode what appears to the user to be a single
202character, for example an accented C<G>, may in fact be composed of a sequence
203of characters, in this case a C<G> followed by an accent character.  C<\X>
204will match the entire sequence.
205
206=item *
207
208The C<tr///> operator translates characters instead of bytes.  Note
209that the C<tr///CU> functionality has been removed.  For similar
210functionality see pack('U0', ...) and pack('C0', ...).
211
212=item *
213
214Case translation operators use the Unicode case translation tables
215when character input is provided.  Note that C<uc()>, or C<\U> in
216interpolated strings, translates to uppercase, while C<ucfirst>,
217or C<\u> in interpolated strings, translates to titlecase in languages
218that make the distinction (which is equivalent to uppercase in languages
219without the distinction).
220
221=item *
222
223Most operators that deal with positions or lengths in a string will
224automatically switch to using character positions, including
225C<chop()>, C<chomp()>, C<substr()>, C<pos()>, C<index()>, C<rindex()>,
226C<sprintf()>, C<write()>, and C<length()>.  An operator that
227specifically does not switch is C<vec()>.  Operators that really don't
228care include operators that treat strings as a bucket of bits such as
229C<sort()>, and operators dealing with filenames.
230
231=item *
232
233The C<pack()>/C<unpack()> letter C<C> does I<not> change, since it is often
234used for byte-oriented formats.  Again, think C<char> in the C language.
235
236There is a new C<U> specifier that converts between Unicode characters
237and code points. There is also a C<W> specifier that is the equivalent of
238C<chr>/C<ord> and properly handles character values even if they are above 255.
239
240=item *
241
242The C<chr()> and C<ord()> functions work on characters, similar to
243C<pack("W")> and C<unpack("W")>, I<not> C<pack("C")> and
244C<unpack("C")>.  C<pack("C")> and C<unpack("C")> are methods for
245emulating byte-oriented C<chr()> and C<ord()> on Unicode strings.
246While these methods reveal the internal encoding of Unicode strings,
247that is not something one normally needs to care about at all.
248
249=item *
250
251The bit string operators, C<& | ^ ~>, can operate on character data.
252However, for backward compatibility, such as when using bit string
253operations when characters are all less than 256 in ordinal value, one
254should not use C<~> (the bit complement) with characters of both
255values less than 256 and values greater than 256.  Most importantly,
256DeMorgan's laws (C<~($x|$y) eq ~$x&~$y> and C<~($x&$y) eq ~$x|~$y>)
257will not hold.  The reason for this mathematical I<faux pas> is that
258the complement cannot return B<both> the 8-bit (byte-wide) bit
259complement B<and> the full character-wide bit complement.
260
261=item *
262
263There is a CPAN module, C<L<Unicode::Casing>>, which allows you to define
264your own mappings to be used in C<lc()>, C<lcfirst()>, C<uc()>,
265C<ucfirst()>, and C<fc> (or their double-quoted string inlined
266versions such as C<\U>).
267(Prior to Perl 5.16, this functionality was partially provided
268in the Perl core, but suffered from a number of insurmountable
269drawbacks, so the CPAN module was written instead.)
270
271=back
272
273=over 4
274
275=item *
276
277And finally, C<scalar reverse()> reverses by character rather than by byte.
278
279=back
280
281=head2 Unicode Character Properties
282
283(The only time that Perl considers a sequence of individual code
284points as a single logical character is in the C<\X> construct, already
285mentioned above.   Therefore "character" in this discussion means a single
286Unicode code point.)
287
288Very nearly all Unicode character properties are accessible through
289regular expressions by using the C<\p{}> "matches property" construct
290and the C<\P{}> "doesn't match property" for its negation.
291
292For instance, C<\p{Uppercase}> matches any single character with the Unicode
293C<"Uppercase"> property, while C<\p{L}> matches any character with a
294C<General_Category> of C<"L"> (letter) property (see
295L</General_Category> below).  Brackets are not
296required for single letter property names, so C<\p{L}> is equivalent to C<\pL>.
297
298More formally, C<\p{Uppercase}> matches any single character whose Unicode
299C<Uppercase> property value is C<True>, and C<\P{Uppercase}> matches any character
300whose C<Uppercase> property value is C<False>, and they could have been written as
301C<\p{Uppercase=True}> and C<\p{Uppercase=False}>, respectively.
302
303This formality is needed when properties are not binary; that is, if they can
304take on more values than just C<True> and C<False>.  For example, the
305C<Bidi_Class> property (see L</"Bidirectional Character Types"> below),
306can take on several different
307values, such as C<Left>, C<Right>, C<Whitespace>, and others.  To match these, one needs
308to specify both the property name (C<Bidi_Class>), AND the value being
309matched against
310(C<Left>, C<Right>, etc.).  This is done, as in the examples above, by having the
311two components separated by an equal sign (or interchangeably, a colon), like
312C<\p{Bidi_Class: Left}>.
313
314All Unicode-defined character properties may be written in these compound forms
315of C<\p{I<property>=I<value>}> or C<\p{I<property>:I<value>}>, but Perl provides some
316additional properties that are written only in the single form, as well as
317single-form short-cuts for all binary properties and certain others described
318below, in which you may omit the property name and the equals or colon
319separator.
320
321Most Unicode character properties have at least two synonyms (or aliases if you
322prefer): a short one that is easier to type and a longer one that is more
323descriptive and hence easier to understand.  Thus the C<"L"> and
324C<"Letter"> properties above are equivalent and can be used
325interchangeably.  Likewise, C<"Upper"> is a synonym for C<"Uppercase">,
326and we could have written C<\p{Uppercase}> equivalently as C<\p{Upper}>.
327Also, there are typically various synonyms for the values the property
328can be.   For binary properties, C<"True"> has 3 synonyms: C<"T">,
329C<"Yes">, and C<"Y">; and C<"False"> has correspondingly C<"F">,
330C<"No">, and C<"N">.  But be careful.  A short form of a value for one
331property may not mean the same thing as the same short form for another.
332Thus, for the C<L</General_Category>> property, C<"L"> means
333C<"Letter">, but for the L<C<Bidi_Class>|/Bidirectional Character Types>
334property, C<"L"> means C<"Left">.  A complete list of properties and
335synonyms is in L<perluniprops>.
336
337Upper/lower case differences in property names and values are irrelevant;
338thus C<\p{Upper}> means the same thing as C<\p{upper}> or even C<\p{UpPeR}>.
339Similarly, you can add or subtract underscores anywhere in the middle of a
340word, so that these are also equivalent to C<\p{U_p_p_e_r}>.  And white space
341is irrelevant adjacent to non-word characters, such as the braces and the equals
342or colon separators, so C<\p{   Upper  }> and C<\p{ Upper_case : Y }> are
343equivalent to these as well.  In fact, white space and even
344hyphens can usually be added or deleted anywhere.  So even C<\p{ Up-per case = Yes}> is
345equivalent.  All this is called "loose-matching" by Unicode.  The few places
346where stricter matching is used is in the middle of numbers, and in the Perl
347extension properties that begin or end with an underscore.  Stricter matching
348cares about white space (except adjacent to non-word characters),
349hyphens, and non-interior underscores.
350
351You can also use negation in both C<\p{}> and C<\P{}> by introducing a caret
352(C<^>) between the first brace and the property name: C<\p{^Tamil}> is
353equal to C<\P{Tamil}>.
354
355Almost all properties are immune to case-insensitive matching.  That is,
356adding a C</i> regular expression modifier does not change what they
357match.  There are two sets that are affected.
358The first set is
359C<Uppercase_Letter>,
360C<Lowercase_Letter>,
361and C<Titlecase_Letter>,
362all of which match C<Cased_Letter> under C</i> matching.
363And the second set is
364C<Uppercase>,
365C<Lowercase>,
366and C<Titlecase>,
367all of which match C<Cased> under C</i> matching.
368This set also includes its subsets C<PosixUpper> and C<PosixLower> both
369of which under C</i> match C<PosixAlpha>.
370(The difference between these sets is that some things, such as Roman
371numerals, come in both upper and lower case so they are C<Cased>, but aren't considered
372letters, so they aren't C<Cased_Letter>s.)
373
374See L</Beyond Unicode code points> for special considerations when
375matching Unicode properties against non-Unicode code points.
376
377=head3 B<General_Category>
378
379Every Unicode character is assigned a general category, which is the "most
380usual categorization of a character" (from
381L<http://www.unicode.org/reports/tr44>).
382
383The compound way of writing these is like C<\p{General_Category=Number}>
384(short, C<\p{gc:n}>).  But Perl furnishes shortcuts in which everything up
385through the equal or colon separator is omitted.  So you can instead just write
386C<\pN>.
387
388Here are the short and long forms of the values the C<General Category> property
389can have:
390
391    Short       Long
392
393    L           Letter
394    LC, L&      Cased_Letter (that is: [\p{Ll}\p{Lu}\p{Lt}])
395    Lu          Uppercase_Letter
396    Ll          Lowercase_Letter
397    Lt          Titlecase_Letter
398    Lm          Modifier_Letter
399    Lo          Other_Letter
400
401    M           Mark
402    Mn          Nonspacing_Mark
403    Mc          Spacing_Mark
404    Me          Enclosing_Mark
405
406    N           Number
407    Nd          Decimal_Number (also Digit)
408    Nl          Letter_Number
409    No          Other_Number
410
411    P           Punctuation (also Punct)
412    Pc          Connector_Punctuation
413    Pd          Dash_Punctuation
414    Ps          Open_Punctuation
415    Pe          Close_Punctuation
416    Pi          Initial_Punctuation
417                (may behave like Ps or Pe depending on usage)
418    Pf          Final_Punctuation
419                (may behave like Ps or Pe depending on usage)
420    Po          Other_Punctuation
421
422    S           Symbol
423    Sm          Math_Symbol
424    Sc          Currency_Symbol
425    Sk          Modifier_Symbol
426    So          Other_Symbol
427
428    Z           Separator
429    Zs          Space_Separator
430    Zl          Line_Separator
431    Zp          Paragraph_Separator
432
433    C           Other
434    Cc          Control (also Cntrl)
435    Cf          Format
436    Cs          Surrogate
437    Co          Private_Use
438    Cn          Unassigned
439
440Single-letter properties match all characters in any of the
441two-letter sub-properties starting with the same letter.
442C<LC> and C<L&> are special: both are aliases for the set consisting of everything matched by C<Ll>, C<Lu>, and C<Lt>.
443
444=head3 B<Bidirectional Character Types>
445
446Because scripts differ in their directionality (Hebrew and Arabic are
447written right to left, for example) Unicode supplies a C<Bidi_Class> property.
448Some of the values this property can have are:
449
450    Value       Meaning
451
452    L           Left-to-Right
453    LRE         Left-to-Right Embedding
454    LRO         Left-to-Right Override
455    R           Right-to-Left
456    AL          Arabic Letter
457    RLE         Right-to-Left Embedding
458    RLO         Right-to-Left Override
459    PDF         Pop Directional Format
460    EN          European Number
461    ES          European Separator
462    ET          European Terminator
463    AN          Arabic Number
464    CS          Common Separator
465    NSM         Non-Spacing Mark
466    BN          Boundary Neutral
467    B           Paragraph Separator
468    S           Segment Separator
469    WS          Whitespace
470    ON          Other Neutrals
471
472This property is always written in the compound form.
473For example, C<\p{Bidi_Class:R}> matches characters that are normally
474written right to left.  Unlike the
475C<L</General_Category>> property, this
476property can have more values added in a future Unicode release.  Those
477listed above comprised the complete set for many Unicode releases, but
478others were added in Unicode 6.3; you can always find what the
479current ones are in in L<perluniprops>.  And
480L<http://www.unicode.org/reports/tr9/> describes how to use them.
481
482=head3 B<Scripts>
483
484The world's languages are written in many different scripts.  This sentence
485(unless you're reading it in translation) is written in Latin, while Russian is
486written in Cyrillic, and Greek is written in, well, Greek; Japanese mainly in
487Hiragana or Katakana.  There are many more.
488
489The Unicode Script and Script_Extensions properties give what script a
490given character is in.  Either property can be specified with the
491compound form like
492C<\p{Script=Hebrew}> (short: C<\p{sc=hebr}>), or
493C<\p{Script_Extensions=Javanese}> (short: C<\p{scx=java}>).
494In addition, Perl furnishes shortcuts for all
495C<Script> property names.  You can omit everything up through the equals
496(or colon), and simply write C<\p{Latin}> or C<\P{Cyrillic}>.
497(This is not true for C<Script_Extensions>, which is required to be
498written in the compound form.)
499
500The difference between these two properties involves characters that are
501used in multiple scripts.  For example the digits '0' through '9' are
502used in many parts of the world.  These are placed in a script named
503C<Common>.  Other characters are used in just a few scripts.  For
504example, the C<"KATAKANA-HIRAGANA DOUBLE HYPHEN"> is used in both Japanese
505scripts, Katakana and Hiragana, but nowhere else.  The C<Script>
506property places all characters that are used in multiple scripts in the
507C<Common> script, while the C<Script_Extensions> property places those
508that are used in only a few scripts into each of those scripts; while
509still using C<Common> for those used in many scripts.  Thus both these
510match:
511
512 "0" =~ /\p{sc=Common}/     # Matches
513 "0" =~ /\p{scx=Common}/    # Matches
514
515and only the first of these match:
516
517 "\N{KATAKANA-HIRAGANA DOUBLE HYPHEN}" =~ /\p{sc=Common}  # Matches
518 "\N{KATAKANA-HIRAGANA DOUBLE HYPHEN}" =~ /\p{scx=Common} # No match
519
520And only the last two of these match:
521
522 "\N{KATAKANA-HIRAGANA DOUBLE HYPHEN}" =~ /\p{sc=Hiragana}  # No match
523 "\N{KATAKANA-HIRAGANA DOUBLE HYPHEN}" =~ /\p{sc=Katakana}  # No match
524 "\N{KATAKANA-HIRAGANA DOUBLE HYPHEN}" =~ /\p{scx=Hiragana} # Matches
525 "\N{KATAKANA-HIRAGANA DOUBLE HYPHEN}" =~ /\p{scx=Katakana} # Matches
526
527C<Script_Extensions> is thus an improved C<Script>, in which there are
528fewer characters in the C<Common> script, and correspondingly more in
529other scripts.  It is new in Unicode version 6.0, and its data are likely
530to change significantly in later releases, as things get sorted out.
531
532(Actually, besides C<Common>, the C<Inherited> script, contains
533characters that are used in multiple scripts.  These are modifier
534characters which modify other characters, and inherit the script value
535of the controlling character.  Some of these are used in many scripts,
536and so go into C<Inherited> in both C<Script> and C<Script_Extensions>.
537Others are used in just a few scripts, so are in C<Inherited> in
538C<Script>, but not in C<Script_Extensions>.)
539
540It is worth stressing that there are several different sets of digits in
541Unicode that are equivalent to 0-9 and are matchable by C<\d> in a
542regular expression.  If they are used in a single language only, they
543are in that language's C<Script> and C<Script_Extension>.  If they are
544used in more than one script, they will be in C<sc=Common>, but only
545if they are used in many scripts should they be in C<scx=Common>.
546
547A complete list of scripts and their shortcuts is in L<perluniprops>.
548
549=head3 B<Use of the C<"Is"> Prefix>
550
551For backward compatibility (with Perl 5.6), all properties mentioned
552so far may have C<Is> or C<Is_> prepended to their name, so C<\P{Is_Lu}>, for
553example, is equal to C<\P{Lu}>, and C<\p{IsScript:Arabic}> is equal to
554C<\p{Arabic}>.
555
556=head3 B<Blocks>
557
558In addition to B<scripts>, Unicode also defines B<blocks> of
559characters.  The difference between scripts and blocks is that the
560concept of scripts is closer to natural languages, while the concept
561of blocks is more of an artificial grouping based on groups of Unicode
562characters with consecutive ordinal values. For example, the C<"Basic Latin">
563block is all characters whose ordinals are between 0 and 127, inclusive; in
564other words, the ASCII characters.  The C<"Latin"> script contains some letters
565from this as well as several other blocks, like C<"Latin-1 Supplement">,
566C<"Latin Extended-A">, etc., but it does not contain all the characters from
567those blocks. It does not, for example, contain the digits 0-9, because
568those digits are shared across many scripts, and hence are in the
569C<Common> script.
570
571For more about scripts versus blocks, see UAX#24 "Unicode Script Property":
572L<http://www.unicode.org/reports/tr24>
573
574The C<Script> or C<Script_Extensions> properties are likely to be the
575ones you want to use when processing
576natural language; the C<Block> property may occasionally be useful in working
577with the nuts and bolts of Unicode.
578
579Block names are matched in the compound form, like C<\p{Block: Arrows}> or
580C<\p{Blk=Hebrew}>.  Unlike most other properties, only a few block names have a
581Unicode-defined short name.  But Perl does provide a (slight) shortcut:  You
582can say, for example C<\p{In_Arrows}> or C<\p{In_Hebrew}>.  For backwards
583compatibility, the C<In> prefix may be omitted if there is no naming conflict
584with a script or any other property, and you can even use an C<Is> prefix
585instead in those cases.  But it is not a good idea to do this, for a couple
586reasons:
587
588=over 4
589
590=item 1
591
592It is confusing.  There are many naming conflicts, and you may forget some.
593For example, C<\p{Hebrew}> means the I<script> Hebrew, and NOT the I<block>
594Hebrew.  But would you remember that 6 months from now?
595
596=item 2
597
598It is unstable.  A new version of Unicode may preempt the current meaning by
599creating a property with the same name.  There was a time in very early Unicode
600releases when C<\p{Hebrew}> would have matched the I<block> Hebrew; now it
601doesn't.
602
603=back
604
605Some people prefer to always use C<\p{Block: foo}> and C<\p{Script: bar}>
606instead of the shortcuts, whether for clarity, because they can't remember the
607difference between 'In' and 'Is' anyway, or they aren't confident that those who
608eventually will read their code will know that difference.
609
610A complete list of blocks and their shortcuts is in L<perluniprops>.
611
612=head3 B<Other Properties>
613
614There are many more properties than the very basic ones described here.
615A complete list is in L<perluniprops>.
616
617Unicode defines all its properties in the compound form, so all single-form
618properties are Perl extensions.  Most of these are just synonyms for the
619Unicode ones, but some are genuine extensions, including several that are in
620the compound form.  And quite a few of these are actually recommended by Unicode
621(in L<http://www.unicode.org/reports/tr18>).
622
623This section gives some details on all extensions that aren't just
624synonyms for compound-form Unicode properties
625(for those properties, you'll have to refer to the
626L<Unicode Standard|http://www.unicode.org/reports/tr44>.
627
628=over
629
630=item B<C<\p{All}>>
631
632This matches every possible code point.  It is equivalent to C<qr/./s>.
633Unlike all the other non-user-defined C<\p{}> property matches, no
634warning is ever generated if this is property is matched against a
635non-Unicode code point (see L</Beyond Unicode code points> below).
636
637=item B<C<\p{Alnum}>>
638
639This matches any C<\p{Alphabetic}> or C<\p{Decimal_Number}> character.
640
641=item B<C<\p{Any}>>
642
643This matches any of the 1_114_112 Unicode code points.  It is a synonym
644for C<\p{Unicode}>.
645
646=item B<C<\p{ASCII}>>
647
648This matches any of the 128 characters in the US-ASCII character set,
649which is a subset of Unicode.
650
651=item B<C<\p{Assigned}>>
652
653This matches any assigned code point; that is, any code point whose L<general
654category|/General_Category> is not C<Unassigned> (or equivalently, not C<Cn>).
655
656=item B<C<\p{Blank}>>
657
658This is the same as C<\h> and C<\p{HorizSpace}>:  A character that changes the
659spacing horizontally.
660
661=item B<C<\p{Decomposition_Type: Non_Canonical}>>    (Short: C<\p{Dt=NonCanon}>)
662
663Matches a character that has a non-canonical decomposition.
664
665To understand the use of this rarely used I<property=value> combination, it is
666necessary to know some basics about decomposition.
667Consider a character, say H.  It could appear with various marks around it,
668such as an acute accent, or a circumflex, or various hooks, circles, arrows,
669I<etc.>, above, below, to one side or the other, etc.  There are many
670possibilities among the world's languages.  The number of combinations is
671astronomical, and if there were a character for each combination, it would
672soon exhaust Unicode's more than a million possible characters.  So Unicode
673took a different approach: there is a character for the base H, and a
674character for each of the possible marks, and these can be variously combined
675to get a final logical character.  So a logical character--what appears to be a
676single character--can be a sequence of more than one individual characters.
677This is called an "extended grapheme cluster";  Perl furnishes the C<\X>
678regular expression construct to match such sequences.
679
680But Unicode's intent is to unify the existing character set standards and
681practices, and several pre-existing standards have single characters that
682mean the same thing as some of these combinations.  An example is ISO-8859-1,
683which has quite a few of these in the Latin-1 range, an example being C<"LATIN
684CAPITAL LETTER E WITH ACUTE">.  Because this character was in this pre-existing
685standard, Unicode added it to its repertoire.  But this character is considered
686by Unicode to be equivalent to the sequence consisting of the character
687C<"LATIN CAPITAL LETTER E"> followed by the character C<"COMBINING ACUTE ACCENT">.
688
689C<"LATIN CAPITAL LETTER E WITH ACUTE"> is called a "pre-composed" character, and
690its equivalence with the sequence is called canonical equivalence.  All
691pre-composed characters are said to have a decomposition (into the equivalent
692sequence), and the decomposition type is also called canonical.
693
694However, many more characters have a different type of decomposition, a
695"compatible" or "non-canonical" decomposition.  The sequences that form these
696decompositions are not considered canonically equivalent to the pre-composed
697character.  An example, again in the Latin-1 range, is the C<"SUPERSCRIPT ONE">.
698It is somewhat like a regular digit 1, but not exactly; its decomposition
699into the digit 1 is called a "compatible" decomposition, specifically a
700"super" decomposition.  There are several such compatibility
701decompositions (see L<http://www.unicode.org/reports/tr44>), including one
702called "compat", which means some miscellaneous type of decomposition
703that doesn't fit into the decomposition categories that Unicode has chosen.
704
705Note that most Unicode characters don't have a decomposition, so their
706decomposition type is C<"None">.
707
708For your convenience, Perl has added the C<Non_Canonical> decomposition
709type to mean any of the several compatibility decompositions.
710
711=item B<C<\p{Graph}>>
712
713Matches any character that is graphic.  Theoretically, this means a character
714that on a printer would cause ink to be used.
715
716=item B<C<\p{HorizSpace}>>
717
718This is the same as C<\h> and C<\p{Blank}>:  a character that changes the
719spacing horizontally.
720
721=item B<C<\p{In=*}>>
722
723This is a synonym for C<\p{Present_In=*}>
724
725=item B<C<\p{PerlSpace}>>
726
727This is the same as C<\s>, restricted to ASCII, namely C<S<[ \f\n\r\t]>>
728and starting in Perl v5.18, experimentally, a vertical tab.
729
730Mnemonic: Perl's (original) space
731
732=item B<C<\p{PerlWord}>>
733
734This is the same as C<\w>, restricted to ASCII, namely C<[A-Za-z0-9_]>
735
736Mnemonic: Perl's (original) word.
737
738=item B<C<\p{Posix...}>>
739
740There are several of these, which are equivalents using the C<\p{}>
741notation for Posix classes and are described in
742L<perlrecharclass/POSIX Character Classes>.
743
744=item B<C<\p{Present_In: *}>>    (Short: C<\p{In=*}>)
745
746This property is used when you need to know in what Unicode version(s) a
747character is.
748
749The "*" above stands for some two digit Unicode version number, such as
750C<1.1> or C<4.0>; or the "*" can also be C<Unassigned>.  This property will
751match the code points whose final disposition has been settled as of the
752Unicode release given by the version number; C<\p{Present_In: Unassigned}>
753will match those code points whose meaning has yet to be assigned.
754
755For example, C<U+0041> C<"LATIN CAPITAL LETTER A"> was present in the very first
756Unicode release available, which is C<1.1>, so this property is true for all
757valid "*" versions.  On the other hand, C<U+1EFF> was not assigned until version
7585.1 when it became C<"LATIN SMALL LETTER Y WITH LOOP">, so the only "*" that
759would match it are 5.1, 5.2, and later.
760
761Unicode furnishes the C<Age> property from which this is derived.  The problem
762with Age is that a strict interpretation of it (which Perl takes) has it
763matching the precise release a code point's meaning is introduced in.  Thus
764C<U+0041> would match only 1.1; and C<U+1EFF> only 5.1.  This is not usually what
765you want.
766
767Some non-Perl implementations of the Age property may change its meaning to be
768the same as the Perl C<Present_In> property; just be aware of that.
769
770Another confusion with both these properties is that the definition is not
771that the code point has been I<assigned>, but that the meaning of the code point
772has been I<determined>.  This is because 66 code points will always be
773unassigned, and so the C<Age> for them is the Unicode version in which the decision
774to make them so was made.  For example, C<U+FDD0> is to be permanently
775unassigned to a character, and the decision to do that was made in version 3.1,
776so C<\p{Age=3.1}> matches this character, as also does C<\p{Present_In: 3.1}> and up.
777
778=item B<C<\p{Print}>>
779
780This matches any character that is graphical or blank, except controls.
781
782=item B<C<\p{SpacePerl}>>
783
784This is the same as C<\s>, including beyond ASCII.
785
786Mnemonic: Space, as modified by Perl.  (It doesn't include the vertical tab
787which both the Posix standard and Unicode consider white space.)
788
789=item B<C<\p{Title}>> and  B<C<\p{Titlecase}>>
790
791Under case-sensitive matching, these both match the same code points as
792C<\p{General Category=Titlecase_Letter}> (C<\p{gc=lt}>).  The difference
793is that under C</i> caseless matching, these match the same as
794C<\p{Cased}>, whereas C<\p{gc=lt}> matches C<\p{Cased_Letter>).
795
796=item B<C<\p{Unicode}>>
797
798This matches any of the 1_114_112 Unicode code points.
799C<\p{Any}>.
800
801=item B<C<\p{VertSpace}>>
802
803This is the same as C<\v>:  A character that changes the spacing vertically.
804
805=item B<C<\p{Word}>>
806
807This is the same as C<\w>, including over 100_000 characters beyond ASCII.
808
809=item B<C<\p{XPosix...}>>
810
811There are several of these, which are the standard Posix classes
812extended to the full Unicode range.  They are described in
813L<perlrecharclass/POSIX Character Classes>.
814
815=back
816
817
818=head2 User-Defined Character Properties
819
820You can define your own binary character properties by defining subroutines
821whose names begin with C<"In"> or C<"Is">.  (The experimental feature
822L<perlre/(?[ ])> provides an alternative which allows more complex
823definitions.)  The subroutines can be defined in any
824package.  The user-defined properties can be used in the regular expression
825C<\p{}> and C<\P{}> constructs; if you are using a user-defined property from a
826package other than the one you are in, you must specify its package in the
827C<\p{}> or C<\P{}> construct.
828
829    # assuming property Is_Foreign defined in Lang::
830    package main;  # property package name required
831    if ($txt =~ /\p{Lang::IsForeign}+/) { ... }
832
833    package Lang;  # property package name not required
834    if ($txt =~ /\p{IsForeign}+/) { ... }
835
836
837Note that the effect is compile-time and immutable once defined.
838However, the subroutines are passed a single parameter, which is 0 if
839case-sensitive matching is in effect and non-zero if caseless matching
840is in effect.  The subroutine may return different values depending on
841the value of the flag, and one set of values will immutably be in effect
842for all case-sensitive matches, and the other set for all case-insensitive
843matches.
844
845Note that if the regular expression is tainted, then Perl will die rather
846than calling the subroutine when the name of the subroutine is
847determined by the tainted data.
848
849The subroutines must return a specially-formatted string, with one
850or more newline-separated lines.  Each line must be one of the following:
851
852=over 4
853
854=item *
855
856A single hexadecimal number denoting a code point to include.
857
858=item *
859
860Two hexadecimal numbers separated by horizontal whitespace (space or
861tabular characters) denoting a range of code points to include.
862
863=item *
864
865Something to include, prefixed by C<"+">: a built-in character
866property (prefixed by C<"utf8::">) or a fully qualified (including package
867name) user-defined character property,
868to represent all the characters in that property; two hexadecimal code
869points for a range; or a single hexadecimal code point.
870
871=item *
872
873Something to exclude, prefixed by C<"-">: an existing character
874property (prefixed by C<"utf8::">) or a fully qualified (including package
875name) user-defined character property,
876to represent all the characters in that property; two hexadecimal code
877points for a range; or a single hexadecimal code point.
878
879=item *
880
881Something to negate, prefixed C<"!">: an existing character
882property (prefixed by C<"utf8::">) or a fully qualified (including package
883name) user-defined character property,
884to represent all the characters in that property; two hexadecimal code
885points for a range; or a single hexadecimal code point.
886
887=item *
888
889Something to intersect with, prefixed by C<"&">: an existing character
890property (prefixed by C<"utf8::">) or a fully qualified (including package
891name) user-defined character property,
892for all the characters except the characters in the property; two
893hexadecimal code points for a range; or a single hexadecimal code point.
894
895=back
896
897For example, to define a property that covers both the Japanese
898syllabaries (hiragana and katakana), you can define
899
900    sub InKana {
901        return <<END;
902    3040\t309F
903    30A0\t30FF
904    END
905    }
906
907Imagine that the here-doc end marker is at the beginning of the line.
908Now you can use C<\p{InKana}> and C<\P{InKana}>.
909
910You could also have used the existing block property names:
911
912    sub InKana {
913        return <<'END';
914    +utf8::InHiragana
915    +utf8::InKatakana
916    END
917    }
918
919Suppose you wanted to match only the allocated characters,
920not the raw block ranges: in other words, you want to remove
921the non-characters:
922
923    sub InKana {
924        return <<'END';
925    +utf8::InHiragana
926    +utf8::InKatakana
927    -utf8::IsCn
928    END
929    }
930
931The negation is useful for defining (surprise!) negated classes.
932
933    sub InNotKana {
934        return <<'END';
935    !utf8::InHiragana
936    -utf8::InKatakana
937    +utf8::IsCn
938    END
939    }
940
941This will match all non-Unicode code points, since every one of them is
942not in Kana.  You can use intersection to exclude these, if desired, as
943this modified example shows:
944
945    sub InNotKana {
946        return <<'END';
947    !utf8::InHiragana
948    -utf8::InKatakana
949    +utf8::IsCn
950    &utf8::Any
951    END
952    }
953
954C<&utf8::Any> must be the last line in the definition.
955
956Intersection is used generally for getting the common characters matched
957by two (or more) classes.  It's important to remember not to use C<"&"> for
958the first set; that would be intersecting with nothing, resulting in an
959empty set.
960
961Unlike non-user-defined C<\p{}> property matches, no warning is ever
962generated if these properties are matched against a non-Unicode code
963point (see L</Beyond Unicode code points> below).
964
965=head2 User-Defined Case Mappings (for serious hackers only)
966
967B<This feature has been removed as of Perl 5.16.>
968The CPAN module C<L<Unicode::Casing>> provides better functionality without
969the drawbacks that this feature had.  If you are using a Perl earlier
970than 5.16, this feature was most fully documented in the 5.14 version of
971this pod:
972L<http://perldoc.perl.org/5.14.0/perlunicode.html#User-Defined-Case-Mappings-%28for-serious-hackers-only%29>
973
974=head2 Character Encodings for Input and Output
975
976See L<Encode>.
977
978=head2 Unicode Regular Expression Support Level
979
980The following list of Unicode supported features for regular expressions describes
981all features currently directly supported by core Perl.  The references to "Level N"
982and the section numbers refer to the Unicode Technical Standard #18,
983"Unicode Regular Expressions", version 13, from August 2008.
984
985=over 4
986
987=item *
988
989Level 1 - Basic Unicode Support
990
991 RL1.1   Hex Notation                     - done          [1]
992 RL1.2   Properties                       - done          [2][3]
993 RL1.2a  Compatibility Properties         - done          [4]
994 RL1.3   Subtraction and Intersection     - experimental  [5]
995 RL1.4   Simple Word Boundaries           - done          [6]
996 RL1.5   Simple Loose Matches             - done          [7]
997 RL1.6   Line Boundaries                  - MISSING       [8][9]
998 RL1.7   Supplementary Code Points        - done          [10]
999
1000=over 4
1001
1002=item [1]
1003
1004C<\x{...}>
1005
1006=item [2]
1007
1008C<\p{...}> C<\P{...}>
1009
1010=item [3]
1011
1012supports not only minimal list, but all Unicode character properties (see Unicode Character Properties above)
1013
1014=item [4]
1015
1016C<\d> C<\D> C<\s> C<\S> C<\w> C<\W> C<\X> C<[:I<prop>:]> C<[:^I<prop>:]>
1017
1018=item [5]
1019
1020The experimental feature in v5.18 C<"(?[...])"> accomplishes this.  See
1021L<perlre/(?[ ])>.  If you don't want to use an experimental feature,
1022you can use one of the following:
1023
1024=over 4
1025
1026=item * Regular expression look-ahead
1027
1028You can mimic class subtraction using lookahead.
1029For example, what UTS#18 might write as
1030
1031    [{Block=Greek}-[{UNASSIGNED}]]
1032
1033in Perl can be written as:
1034
1035    (?!\p{Unassigned})\p{Block=Greek}
1036    (?=\p{Assigned})\p{Block=Greek}
1037
1038But in this particular example, you probably really want
1039
1040    \p{Greek}
1041
1042which will match assigned characters known to be part of the Greek script.
1043
1044=item * CPAN module C<L<Unicode::Regex::Set>>
1045
1046It does implement the full UTS#18 grouping, intersection, union, and
1047removal (subtraction) syntax.
1048
1049=item * L</"User-Defined Character Properties">
1050
1051C<"+"> for union, C<"-"> for removal (set-difference), C<"&"> for intersection
1052
1053=back
1054
1055=item [6]
1056
1057C<\b> C<\B>
1058
1059=item [7]
1060
1061Note that Perl does Full case-folding in matching (but with bugs), not
1062Simple: for example C<U+1F88> is equivalent to C<U+1F00 U+03B9>, instead of
1063just C<U+1F80>.  This difference matters mainly for certain Greek capital
1064letters with certain modifiers: the Full case-folding decomposes the
1065letter, while the Simple case-folding would map it to a single
1066character.
1067
1068=item [8]
1069
1070Should do C<^> and C<$> also on C<U+000B> (C<\v> in C), C<FF> (C<\f>),
1071C<CR> (C<\r>), C<CRLF> (C<\r\n>), C<NEL> (C<U+0085>), C<LS> (C<U+2028>),
1072and C<PS> (C<U+2029>); should also affect C<E<lt>E<gt>>, C<$.>, and
1073script line numbers; should not split lines within C<CRLF> (i.e. there
1074is no empty line between C<\r> and C<\n>).  For C<CRLF>, try the
1075C<:crlf> layer (see L<PerlIO>).
1076
1077=item [9]
1078
1079Linebreaking conformant with L<UAX#14 "Unicode Line Breaking
1080Algorithm"|http://www.unicode.org/reports/tr14>
1081is available through the C<L<Unicode::LineBreak>> module.
1082
1083=item [10]
1084
1085UTF-8/UTF-EBDDIC used in Perl allows not only C<U+10000> to
1086C<U+10FFFF> but also beyond C<U+10FFFF>
1087
1088=back
1089
1090=item *
1091
1092Level 2 - Extended Unicode Support
1093
1094 RL2.1   Canonical Equivalents           - MISSING       [10][11]
1095 RL2.2   Default Grapheme Clusters       - MISSING       [12]
1096 RL2.3   Default Word Boundaries         - MISSING       [14]
1097 RL2.4   Default Loose Matches           - MISSING       [15]
1098 RL2.5   Name Properties                 - DONE
1099 RL2.6   Wildcard Properties             - MISSING
1100
1101 [10] see UAX#15 "Unicode Normalization Forms"
1102 [11] have Unicode::Normalize but not integrated to regexes
1103 [12] have \X but we don't have a "Grapheme Cluster Mode"
1104 [14] see UAX#29, Word Boundaries
1105 [15] This is covered in Chapter 3.13 (in Unicode 6.0)
1106
1107=item *
1108
1109Level 3 - Tailored Support
1110
1111 RL3.1   Tailored Punctuation            - MISSING
1112 RL3.2   Tailored Grapheme Clusters      - MISSING       [17][18]
1113 RL3.3   Tailored Word Boundaries        - MISSING
1114 RL3.4   Tailored Loose Matches          - MISSING
1115 RL3.5   Tailored Ranges                 - MISSING
1116 RL3.6   Context Matching                - MISSING       [19]
1117 RL3.7   Incremental Matches             - MISSING
1118      ( RL3.8   Unicode Set Sharing )
1119 RL3.9   Possible Match Sets             - MISSING
1120 RL3.10  Folded Matching                 - MISSING       [20]
1121 RL3.11  Submatchers                     - MISSING
1122
1123 [17] see UAX#10 "Unicode Collation Algorithms"
1124 [18] have Unicode::Collate but not integrated to regexes
1125 [19] have (?<=x) and (?=x), but look-aheads or look-behinds
1126      should see outside of the target substring
1127 [20] need insensitive matching for linguistic features other
1128      than case; for example, hiragana to katakana, wide and
1129      narrow, simplified Han to traditional Han (see UTR#30
1130      "Character Foldings")
1131
1132=back
1133
1134=head2 Unicode Encodings
1135
1136Unicode characters are assigned to I<code points>, which are abstract
1137numbers.  To use these numbers, various encodings are needed.
1138
1139=over 4
1140
1141=item *
1142
1143UTF-8
1144
1145UTF-8 is a variable-length (1 to 4 bytes), byte-order independent
1146encoding. For ASCII (and we really do mean 7-bit ASCII, not another
11478-bit encoding), UTF-8 is transparent.
1148
1149The following table is from Unicode 3.2.
1150
1151 Code Points            1st Byte  2nd Byte  3rd Byte 4th Byte
1152
1153   U+0000..U+007F       00..7F
1154   U+0080..U+07FF     * C2..DF    80..BF
1155   U+0800..U+0FFF       E0      * A0..BF    80..BF
1156   U+1000..U+CFFF       E1..EC    80..BF    80..BF
1157   U+D000..U+D7FF       ED        80..9F    80..BF
1158   U+D800..U+DFFF       +++++ utf16 surrogates, not legal utf8 +++++
1159   U+E000..U+FFFF       EE..EF    80..BF    80..BF
1160  U+10000..U+3FFFF      F0      * 90..BF    80..BF    80..BF
1161  U+40000..U+FFFFF      F1..F3    80..BF    80..BF    80..BF
1162 U+100000..U+10FFFF     F4        80..8F    80..BF    80..BF
1163
1164Note the gaps marked by "*" before several of the byte entries above.  These are
1165caused by legal UTF-8 avoiding non-shortest encodings: it is technically
1166possible to UTF-8-encode a single code point in different ways, but that is
1167explicitly forbidden, and the shortest possible encoding should always be used
1168(and that is what Perl does).
1169
1170Another way to look at it is via bits:
1171
1172                Code Points  1st Byte  2nd Byte  3rd Byte  4th Byte
1173
1174                   0aaaaaaa  0aaaaaaa
1175           00000bbbbbaaaaaa  110bbbbb  10aaaaaa
1176           ccccbbbbbbaaaaaa  1110cccc  10bbbbbb  10aaaaaa
1177 00000dddccccccbbbbbbaaaaaa  11110ddd  10cccccc  10bbbbbb  10aaaaaa
1178
1179As you can see, the continuation bytes all begin with C<"10">, and the
1180leading bits of the start byte tell how many bytes there are in the
1181encoded character.
1182
1183The original UTF-8 specification allowed up to 6 bytes, to allow
1184encoding of numbers up to C<0x7FFF_FFFF>.  Perl continues to allow those,
1185and has extended that up to 13 bytes to encode code points up to what
1186can fit in a 64-bit word.  However, Perl will warn if you output any of
1187these as being non-portable; and under strict UTF-8 input protocols,
1188they are forbidden.
1189
1190The Unicode non-character code points are also disallowed in UTF-8 in
1191"open interchange".  See L</Non-character code points>.
1192
1193=item *
1194
1195UTF-EBCDIC
1196
1197Like UTF-8 but EBCDIC-safe, in the way that UTF-8 is ASCII-safe.
1198
1199=item *
1200
1201UTF-16, UTF-16BE, UTF-16LE, Surrogates, and C<BOM>s (Byte Order Marks)
1202
1203The followings items are mostly for reference and general Unicode
1204knowledge, Perl doesn't use these constructs internally.
1205
1206Like UTF-8, UTF-16 is a variable-width encoding, but where
1207UTF-8 uses 8-bit code units, UTF-16 uses 16-bit code units.
1208All code points occupy either 2 or 4 bytes in UTF-16: code points
1209C<U+0000..U+FFFF> are stored in a single 16-bit unit, and code
1210points C<U+10000..U+10FFFF> in two 16-bit units.  The latter case is
1211using I<surrogates>, the first 16-bit unit being the I<high
1212surrogate>, and the second being the I<low surrogate>.
1213
1214Surrogates are code points set aside to encode the C<U+10000..U+10FFFF>
1215range of Unicode code points in pairs of 16-bit units.  The I<high
1216surrogates> are the range C<U+D800..U+DBFF> and the I<low surrogates>
1217are the range C<U+DC00..U+DFFF>.  The surrogate encoding is
1218
1219    $hi = ($uni - 0x10000) / 0x400 + 0xD800;
1220    $lo = ($uni - 0x10000) % 0x400 + 0xDC00;
1221
1222and the decoding is
1223
1224    $uni = 0x10000 + ($hi - 0xD800) * 0x400 + ($lo - 0xDC00);
1225
1226Because of the 16-bitness, UTF-16 is byte-order dependent.  UTF-16
1227itself can be used for in-memory computations, but if storage or
1228transfer is required either UTF-16BE (big-endian) or UTF-16LE
1229(little-endian) encodings must be chosen.
1230
1231This introduces another problem: what if you just know that your data
1232is UTF-16, but you don't know which endianness?  Byte Order Marks, or
1233C<BOM>s, are a solution to this.  A special character has been reserved
1234in Unicode to function as a byte order marker: the character with the
1235code point C<U+FEFF> is the C<BOM>.
1236
1237The trick is that if you read a C<BOM>, you will know the byte order,
1238since if it was written on a big-endian platform, you will read the
1239bytes C<0xFE 0xFF>, but if it was written on a little-endian platform,
1240you will read the bytes C<0xFF 0xFE>.  (And if the originating platform
1241was writing in UTF-8, you will read the bytes C<0xEF 0xBB 0xBF>.)
1242
1243The way this trick works is that the character with the code point
1244C<U+FFFE> is not supposed to be in input streams, so the
1245sequence of bytes C<0xFF 0xFE> is unambiguously "C<BOM>, represented in
1246little-endian format" and cannot be C<U+FFFE>, represented in big-endian
1247format".
1248
1249Surrogates have no meaning in Unicode outside their use in pairs to
1250represent other code points.  However, Perl allows them to be
1251represented individually internally, for example by saying
1252C<chr(0xD801)>, so that all code points, not just those valid for open
1253interchange, are
1254representable.  Unicode does define semantics for them, such as their
1255C<L</General_Category>> is C<"Cs">.  But because their use is somewhat dangerous,
1256Perl will warn (using the warning category C<"surrogate">, which is a
1257sub-category of C<"utf8">) if an attempt is made
1258to do things like take the lower case of one, or match
1259case-insensitively, or to output them.  (But don't try this on Perls
1260before 5.14.)
1261
1262=item *
1263
1264UTF-32, UTF-32BE, UTF-32LE
1265
1266The UTF-32 family is pretty much like the UTF-16 family, expect that
1267the units are 32-bit, and therefore the surrogate scheme is not
1268needed.  UTF-32 is a fixed-width encoding.  The C<BOM> signatures are
1269C<0x00 0x00 0xFE 0xFF> for BE and C<0xFF 0xFE 0x00 0x00> for LE.
1270
1271=item *
1272
1273UCS-2, UCS-4
1274
1275Legacy, fixed-width encodings defined by the ISO 10646 standard.  UCS-2 is a 16-bit
1276encoding.  Unlike UTF-16, UCS-2 is not extensible beyond C<U+FFFF>,
1277because it does not use surrogates.  UCS-4 is a 32-bit encoding,
1278functionally identical to UTF-32 (the difference being that
1279UCS-4 forbids neither surrogates nor code points larger than C<0x10_FFFF>).
1280
1281=item *
1282
1283UTF-7
1284
1285A seven-bit safe (non-eight-bit) encoding, which is useful if the
1286transport or storage is not eight-bit safe.  Defined by RFC 2152.
1287
1288=back
1289
1290=head2 Non-character code points
1291
129266 code points are set aside in Unicode as "non-character code points".
1293These all have the C<Unassigned> (C<Cn>) C<L</General_Category>>, and
1294they never will
1295be assigned.  These are never supposed to be in legal Unicode input
1296streams, so that code can use them as sentinels that can be mixed in
1297with character data, and they always will be distinguishable from that data.
1298To keep them out of Perl input streams, strict UTF-8 should be
1299specified, such as by using the layer C<:encoding('UTF-8')>.  The
1300non-character code points are the 32 between C<U+FDD0> and C<U+FDEF>, and the
130134 code points C<U+FFFE>, C<U+FFFF>, C<U+1FFFE>, C<U+1FFFF>, ... C<U+10FFFE>, C<U+10FFFF>.
1302Some people are under the mistaken impression that these are "illegal",
1303but that is not true.  An application or cooperating set of applications
1304can legally use them at will internally; but these code points are
1305"illegal for open interchange".  Therefore, Perl will not accept these
1306from input streams unless lax rules are being used, and will warn
1307(using the warning category C<"nonchar">, which is a sub-category of C<"utf8">) if
1308an attempt is made to output them.
1309
1310=head2 Beyond Unicode code points
1311
1312The maximum Unicode code point is C<U+10FFFF>, and Unicode only defines
1313operations on code points up through that.  But Perl works on code
1314points up to the maximum permissible unsigned number available on the
1315platform.  However, Perl will not accept these from input streams unless
1316lax rules are being used, and will warn (using the warning category
1317C<"non_unicode">, which is a sub-category of C<"utf8">) if any are output.
1318
1319Since Unicode rules are not defined on these code points, if a
1320Unicode-defined operation is done on them, Perl uses what we believe are
1321sensible rules, while generally warning, using the C<"non_unicode">
1322category.  For example, C<uc("\x{11_0000}")> will generate such a
1323warning, returning the input parameter as its result, since Perl defines
1324the uppercase of every non-Unicode code point to be the code point
1325itself.  In fact, all the case changing operations, not just
1326uppercasing, work this way.
1327
1328The situation with matching Unicode properties in regular expressions,
1329the C<\p{}> and C<\P{}> constructs, against these code points is not as
1330clear cut, and how these are handled has changed as we've gained
1331experience.
1332
1333One possibility is to treat any match against these code points as
1334undefined.  But since Perl doesn't have the concept of a match being
1335undefined, it converts this to failing or C<FALSE>.  This is almost, but
1336not quite, what Perl did from v5.14 (when use of these code points
1337became generally reliable) through v5.18.  The difference is that Perl
1338treated all C<\p{}> matches as failing, but all C<\P{}> matches as
1339succeeding.
1340
1341One problem with this is that it leads to unexpected, and confusting
1342results in some cases:
1343
1344 chr(0x110000) =~ \p{ASCII_Hex_Digit=True}      # Failed on <= v5.18
1345 chr(0x110000) =~ \p{ASCII_Hex_Digit=False}     # Failed! on <= v5.18
1346
1347That is, it treated both matches as undefined, and converted that to
1348false (raising a warning on each).  The first case is the expected
1349result, but the second is likely counterintuitive: "How could both be
1350false when they are complements?"  Another problem was that the
1351implementation optimized many Unicode property matches down to already
1352existing simpler, faster operations, which don't raise the warning.  We
1353chose to not forgo those optimizations, which help the vast majority of
1354matches, just to generate a warning for the unlikely event that an
1355above-Unicode code point is being matched against.
1356
1357As a result of these problems, starting in v5.20, what Perl does is
1358to treat non-Unicode code points as just typical unassigned Unicode
1359characters, and matches accordingly.  (Note: Unicode has atypical
1360unassigned code points.  For example, it has non-character code points,
1361and ones that, when they do get assigned, are destined to be written
1362Right-to-left, as Arabic and Hebrew are.  Perl assumes that no
1363non-Unicode code point has any atypical properties.)
1364
1365Perl, in most cases, will raise a warning when matching an above-Unicode
1366code point against a Unicode property when the result is C<TRUE> for
1367C<\p{}>, and C<FALSE> for C<\P{}>.  For example:
1368
1369 chr(0x110000) =~ \p{ASCII_Hex_Digit=True}      # Fails, no warning
1370 chr(0x110000) =~ \p{ASCII_Hex_Digit=False}     # Succeeds, with warning
1371
1372In both these examples, the character being matched is non-Unicode, so
1373Unicode doesn't define how it should match.  It clearly isn't an ASCII
1374hex digit, so the first example clearly should fail, and so it does,
1375with no warning.  But it is arguable that the second example should have
1376an undefined, hence C<FALSE>, result.  So a warning is raised for it.
1377
1378Thus the warning is raised for many fewer cases than in earlier Perls,
1379and only when what the result is could be arguable.  It turns out that
1380none of the optimizations made by Perl (or are ever likely to be made)
1381cause the warning to be skipped, so it solves both problems of Perl's
1382earlier approach.  The most commonly used property that is affected by
1383this change is C<\p{Unassigned}> which is a short form for
1384C<\p{General_Category=Unassigned}>.  Starting in v5.20, all non-Unicode
1385code points are considered C<Unassigned>.  In earlier releases the
1386matches failed because the result was considered undefined.
1387
1388The only place where the warning is not raised when it might ought to
1389have been is if optimizations cause the whole pattern match to not even
1390be attempted.  For example, Perl may figure out that for a string to
1391match a certain regular expression pattern, the string has to contain
1392the substring C<"foobar">.  Before attempting the match, Perl may look
1393for that substring, and if not found, immediately fail the match without
1394actually trying it; so no warning gets generated even if the string
1395contains an above-Unicode code point.
1396
1397This behavior is more "Do what I mean" than in earlier Perls for most
1398applications.  But it catches fewer issues for code that needs to be
1399strictly Unicode compliant.  Therefore there is an additional mode of
1400operation available to accommodate such code.  This mode is enabled if a
1401regular expression pattern is compiled within the lexical scope where
1402the C<"non_unicode"> warning class has been made fatal, say by:
1403
1404 use warnings FATAL => "non_unicode"
1405
1406(see L<warnings>).  In this mode of operation, Perl will raise the
1407warning for all matches against a non-Unicode code point (not just the
1408arguable ones), and it skips the optimizations that might cause the
1409warning to not be output.  (It currently still won't warn if the match
1410isn't even attempted, like in the C<"foobar"> example above.)
1411
1412In summary, Perl now normally treats non-Unicode code points as typical
1413Unicode unassigned code points for regular expression matches, raising a
1414warning only when it is arguable what the result should be.  However, if
1415this warning has been made fatal, it isn't skipped.
1416
1417There is one exception to all this.  C<\p{All}> looks like a Unicode
1418property, but it is a Perl extension that is defined to be true for all
1419possible code points, Unicode or not, so no warning is ever generated
1420when matching this against a non-Unicode code point.  (Prior to v5.20,
1421it was an exact synonym for C<\p{Any}>, matching code points C<0>
1422through C<0x10FFFF>.)
1423
1424=head2 Security Implications of Unicode
1425
1426Read L<Unicode Security Considerations|http://www.unicode.org/reports/tr36>.
1427Also, note the following:
1428
1429=over 4
1430
1431=item *
1432
1433Malformed UTF-8
1434
1435Unfortunately, the original specification of UTF-8 leaves some room for
1436interpretation of how many bytes of encoded output one should generate
1437from one input Unicode character.  Strictly speaking, the shortest
1438possible sequence of UTF-8 bytes should be generated,
1439because otherwise there is potential for an input buffer overflow at
1440the receiving end of a UTF-8 connection.  Perl always generates the
1441shortest length UTF-8, and with warnings on, Perl will warn about
1442non-shortest length UTF-8 along with other malformations, such as the
1443surrogates, which are not Unicode code points valid for interchange.
1444
1445=item *
1446
1447Regular expression pattern matching may surprise you if you're not
1448accustomed to Unicode.  Starting in Perl 5.14, several pattern
1449modifiers are available to control this, called the character set
1450modifiers.  Details are given in L<perlre/Character set modifiers>.
1451
1452=back
1453
1454As discussed elsewhere, Perl has one foot (two hooves?) planted in
1455each of two worlds: the old world of bytes and the new world of
1456characters, upgrading from bytes to characters when necessary.
1457If your legacy code does not explicitly use Unicode, no automatic
1458switch-over to characters should happen.  Characters shouldn't get
1459downgraded to bytes, either.  It is possible to accidentally mix bytes
1460and characters, however (see L<perluniintro>), in which case C<\w> in
1461regular expressions might start behaving differently (unless the C</a>
1462modifier is in effect).  Review your code.  Use warnings and the C<strict> pragma.
1463
1464=head2 Unicode in Perl on EBCDIC
1465
1466The way Unicode is handled on EBCDIC platforms is still
1467experimental.  On such platforms, references to UTF-8 encoding in this
1468document and elsewhere should be read as meaning the UTF-EBCDIC
1469specified in Unicode Technical Report 16, unless ASCII vs. EBCDIC issues
1470are specifically discussed. There is no C<utfebcdic> pragma or
1471C<":utfebcdic"> layer; rather, C<"utf8"> and C<":utf8"> are reused to mean
1472the platform's "natural" 8-bit encoding of Unicode. See L<perlebcdic>
1473for more discussion of the issues.
1474
1475=head2 Locales
1476
1477See L<perllocale/Unicode and UTF-8>
1478
1479=head2 When Unicode Does Not Happen
1480
1481While Perl does have extensive ways to input and output in Unicode,
1482and a few other "entry points" like the C<@ARGV> array (which can sometimes be
1483interpreted as UTF-8), there are still many places where Unicode
1484(in some encoding or another) could be given as arguments or received as
1485results, or both, but it is not.
1486
1487The following are such interfaces.  Also, see L</The "Unicode Bug">.
1488For all of these interfaces Perl
1489currently (as of v5.16.0) simply assumes byte strings both as arguments
1490and results, or UTF-8 strings if the (problematic) C<encoding> pragma has been used.
1491
1492One reason that Perl does not attempt to resolve the role of Unicode in
1493these situations is that the answers are highly dependent on the operating
1494system and the file system(s).  For example, whether filenames can be
1495in Unicode and in exactly what kind of encoding, is not exactly a
1496portable concept.  Similarly for C<qx> and C<system>: how well will the
1497"command-line interface" (and which of them?) handle Unicode?
1498
1499=over 4
1500
1501=item *
1502
1503C<chdir>, C<chmod>, C<chown>, C<chroot>, C<exec>, C<link>, C<lstat>, C<mkdir>,
1504C<rename>, C<rmdir>, C<stat>, C<symlink>, C<truncate>, C<unlink>, C<utime>, C<-X>
1505
1506=item *
1507
1508C<%ENV>
1509
1510=item *
1511
1512C<glob> (aka the C<E<lt>*E<gt>>)
1513
1514=item *
1515
1516C<open>, C<opendir>, C<sysopen>
1517
1518=item *
1519
1520C<qx> (aka the backtick operator), C<system>
1521
1522=item *
1523
1524C<readdir>, C<readlink>
1525
1526=back
1527
1528=head2 The "Unicode Bug"
1529
1530The term, "Unicode bug" has been applied to an inconsistency
1531on ASCII platforms with the
1532Unicode code points in the C<Latin-1 Supplement> block, that
1533is, between 128 and 255.  Without a locale specified, unlike all other
1534characters or code points, these characters have very different semantics in
1535byte semantics versus character semantics, unless
1536C<use feature 'unicode_strings'> is specified, directly or indirectly.
1537(It is indirectly specified by a C<use v5.12> or higher.)
1538
1539In character semantics these upper-Latin1 characters are interpreted as
1540Unicode code points, which means
1541they have the same semantics as Latin-1 (ISO-8859-1).
1542
1543In byte semantics (without C<unicode_strings>), they are considered to
1544be unassigned characters, meaning that the only semantics they have is
1545their ordinal numbers, and that they are
1546not members of various character classes.  None are considered to match C<\w>
1547for example, but all match C<\W>.
1548
1549Perl 5.12.0 added C<unicode_strings> to force character semantics on
1550these code points in some circumstances, which fixed portions of the
1551bug; Perl 5.14.0 fixed almost all of it; and Perl 5.16.0 fixed the
1552remainder (so far as we know, anyway).  The lesson here is to enable
1553C<unicode_strings> to avoid the headaches described below.
1554
1555The old, problematic behavior affects these areas:
1556
1557=over 4
1558
1559=item *
1560
1561Changing the case of a scalar, that is, using C<uc()>, C<ucfirst()>, C<lc()>,
1562and C<lcfirst()>, or C<\L>, C<\U>, C<\u> and C<\l> in double-quotish
1563contexts, such as regular expression substitutions.
1564Under C<unicode_strings> starting in Perl 5.12.0, character semantics are
1565generally used.  See L<perlfunc/lc> for details on how this works
1566in combination with various other pragmas.
1567
1568=item *
1569
1570Using caseless (C</i>) regular expression matching.
1571Starting in Perl 5.14.0, regular expressions compiled within
1572the scope of C<unicode_strings> use character semantics
1573even when executed or compiled into larger
1574regular expressions outside the scope.
1575
1576=item *
1577
1578Matching any of several properties in regular expressions, namely C<\b>,
1579C<\B>, C<\s>, C<\S>, C<\w>, C<\W>, and all the Posix character classes
1580I<except> C<[[:ascii:]]>.
1581Starting in Perl 5.14.0, regular expressions compiled within
1582the scope of C<unicode_strings> use character semantics
1583even when executed or compiled into larger
1584regular expressions outside the scope.
1585
1586=item *
1587
1588In C<quotemeta> or its inline equivalent C<\Q>, no code points above 127
1589are quoted in UTF-8 encoded strings, but in byte encoded strings, code
1590points between 128-255 are always quoted.
1591Starting in Perl 5.16.0, consistent quoting rules are used within the
1592scope of C<unicode_strings>, as described in L<perlfunc/quotemeta>.
1593
1594=back
1595
1596This behavior can lead to unexpected results in which a string's semantics
1597suddenly change if a code point above 255 is appended to or removed from it,
1598which changes the string's semantics from byte to character or vice versa.  As
1599an example, consider the following program and its output:
1600
1601 $ perl -le'
1602     no feature 'unicode_strings';
1603     $s1 = "\xC2";
1604     $s2 = "\x{2660}";
1605     for ($s1, $s2, $s1.$s2) {
1606         print /\w/ || 0;
1607     }
1608 '
1609 0
1610 0
1611 1
1612
1613If there's no C<\w> in C<s1> or in C<s2>, why does their concatenation have one?
1614
1615This anomaly stems from Perl's attempt to not disturb older programs that
1616didn't use Unicode, and hence had no semantics for characters outside of the
1617ASCII range (except in a locale), along with Perl's desire to add Unicode
1618support seamlessly.  The result wasn't seamless: these characters were
1619orphaned.
1620
1621For Perls earlier than those described above, or when a string is passed
1622to a function outside the subpragma's scope, a workaround is to always
1623call L<C<utf8::upgrade($string)>|utf8/Utility functions>,
1624or to use the standard module L<Encode>.   Also, a scalar that has any characters
1625whose ordinal is C<0x100> or above, or which were specified using either of the
1626C<\N{...}> notations, will automatically have character semantics.
1627
1628=head2 Forcing Unicode in Perl (Or Unforcing Unicode in Perl)
1629
1630Sometimes (see L</"When Unicode Does Not Happen"> or L</The "Unicode Bug">)
1631there are situations where you simply need to force a byte
1632string into UTF-8, or vice versa.  The low-level calls
1633L<C<utf8::upgrade($bytestring)>|utf8/Utility functions> and
1634L<C<utf8::downgrade($utf8string[, FAIL_OK])>|utf8/Utility functions> are
1635the answers.
1636
1637Note that C<utf8::downgrade()> can fail if the string contains characters
1638that don't fit into a byte.
1639
1640Calling either function on a string that already is in the desired state is a
1641no-op.
1642
1643=head2 Using Unicode in XS
1644
1645If you want to handle Perl Unicode in XS extensions, you may find the
1646following C APIs useful.  See also L<perlguts/"Unicode Support"> for an
1647explanation about Unicode at the XS level, and L<perlapi> for the API
1648details.
1649
1650=over 4
1651
1652=item *
1653
1654C<DO_UTF8(sv)> returns true if the C<UTF8> flag is on and the bytes
1655pragma is not in effect.  C<SvUTF8(sv)> returns true if the C<UTF8>
1656flag is on; the C<bytes> pragma is ignored.  The C<UTF8> flag being on
1657does B<not> mean that there are any characters of code points greater
1658than 255 (or 127) in the scalar or that there are even any characters
1659in the scalar.  What the C<UTF8> flag means is that the sequence of
1660octets in the representation of the scalar is the sequence of UTF-8
1661encoded code points of the characters of a string.  The C<UTF8> flag
1662being off means that each octet in this representation encodes a
1663single character with code point 0..255 within the string.  Perl's
1664Unicode model is not to use UTF-8 until it is absolutely necessary.
1665
1666=item *
1667
1668C<uvchr_to_utf8(buf, chr)> writes a Unicode character code point into
1669a buffer encoding the code point as UTF-8, and returns a pointer
1670pointing after the UTF-8 bytes.  It works appropriately on EBCDIC machines.
1671
1672=item *
1673
1674C<utf8_to_uvchr_buf(buf, bufend, lenp)> reads UTF-8 encoded bytes from a
1675buffer and
1676returns the Unicode character code point and, optionally, the length of
1677the UTF-8 byte sequence.  It works appropriately on EBCDIC machines.
1678
1679=item *
1680
1681C<utf8_length(start, end)> returns the length of the UTF-8 encoded buffer
1682in characters.  C<sv_len_utf8(sv)> returns the length of the UTF-8 encoded
1683scalar.
1684
1685=item *
1686
1687C<sv_utf8_upgrade(sv)> converts the string of the scalar to its UTF-8
1688encoded form.  C<sv_utf8_downgrade(sv)> does the opposite, if
1689possible.  C<sv_utf8_encode(sv)> is like sv_utf8_upgrade except that
1690it does not set the C<UTF8> flag.  C<sv_utf8_decode()> does the
1691opposite of C<sv_utf8_encode()>.  Note that none of these are to be
1692used as general-purpose encoding or decoding interfaces: C<use Encode>
1693for that.  C<sv_utf8_upgrade()> is affected by the encoding pragma
1694but C<sv_utf8_downgrade()> is not (since the encoding pragma is
1695designed to be a one-way street).
1696
1697=item *
1698
1699C<is_utf8_string(buf, len)> returns true if C<len> bytes of the buffer
1700are valid UTF-8.
1701
1702=item *
1703
1704C<is_utf8_char_buf(buf, buf_end)> returns true if the pointer points to
1705a valid UTF-8 character.
1706
1707=item *
1708
1709C<UTF8SKIP(buf)> will return the number of bytes in the UTF-8 encoded
1710character in the buffer.  C<UNISKIP(chr)> will return the number of bytes
1711required to UTF-8-encode the Unicode character code point.  C<UTF8SKIP()>
1712is useful for example for iterating over the characters of a UTF-8
1713encoded buffer; C<UNISKIP()> is useful, for example, in computing
1714the size required for a UTF-8 encoded buffer.
1715
1716=item *
1717
1718C<utf8_distance(a, b)> will tell the distance in characters between the
1719two pointers pointing to the same UTF-8 encoded buffer.
1720
1721=item *
1722
1723C<utf8_hop(s, off)> will return a pointer to a UTF-8 encoded buffer
1724that is C<off> (positive or negative) Unicode characters displaced
1725from the UTF-8 buffer C<s>.  Be careful not to overstep the buffer:
1726C<utf8_hop()> will merrily run off the end or the beginning of the
1727buffer if told to do so.
1728
1729=item *
1730
1731C<pv_uni_display(dsv, spv, len, pvlim, flags)> and
1732C<sv_uni_display(dsv, ssv, pvlim, flags)> are useful for debugging the
1733output of Unicode strings and scalars.  By default they are useful
1734only for debugging--they display B<all> characters as hexadecimal code
1735points--but with the flags C<UNI_DISPLAY_ISPRINT>,
1736C<UNI_DISPLAY_BACKSLASH>, and C<UNI_DISPLAY_QQ> you can make the
1737output more readable.
1738
1739=item *
1740
1741C<foldEQ_utf8(s1, pe1, l1, u1, s2, pe2, l2, u2)> can be used to
1742compare two strings case-insensitively in Unicode.  For case-sensitive
1743comparisons you can just use C<memEQ()> and C<memNE()> as usual, except
1744if one string is in utf8 and the other isn't.
1745
1746=back
1747
1748For more information, see L<perlapi>, and F<utf8.c> and F<utf8.h>
1749in the Perl source code distribution.
1750
1751=head2 Hacking Perl to work on earlier Unicode versions (for very serious hackers only)
1752
1753Perl by default comes with the latest supported Unicode version built in, but
1754you can change to use any earlier one.
1755
1756Download the files in the desired version of Unicode from the Unicode web
1757site L<http://www.unicode.org>).  These should replace the existing files in
1758F<lib/unicore> in the Perl source tree.  Follow the instructions in
1759F<README.perl> in that directory to change some of their names, and then build
1760perl (see L<INSTALL>).
1761
1762=head1 BUGS
1763
1764=head2 Interaction with Locales
1765
1766See L<perllocale/Unicode and UTF-8>
1767
1768=head2 Problems with characters in the Latin-1 Supplement range
1769
1770See L</The "Unicode Bug">
1771
1772=head2 Interaction with Extensions
1773
1774When Perl exchanges data with an extension, the extension should be
1775able to understand the UTF8 flag and act accordingly. If the
1776extension doesn't recognize that flag, it's likely that the extension
1777will return incorrectly-flagged data.
1778
1779So if you're working with Unicode data, consult the documentation of
1780every module you're using if there are any issues with Unicode data
1781exchange. If the documentation does not talk about Unicode at all,
1782suspect the worst and probably look at the source to learn how the
1783module is implemented. Modules written completely in Perl shouldn't
1784cause problems. Modules that directly or indirectly access code written
1785in other programming languages are at risk.
1786
1787For affected functions, the simple strategy to avoid data corruption is
1788to always make the encoding of the exchanged data explicit. Choose an
1789encoding that you know the extension can handle. Convert arguments passed
1790to the extensions to that encoding and convert results back from that
1791encoding. Write wrapper functions that do the conversions for you, so
1792you can later change the functions when the extension catches up.
1793
1794To provide an example, let's say the popular C<Foo::Bar::escape_html>
1795function doesn't deal with Unicode data yet. The wrapper function
1796would convert the argument to raw UTF-8 and convert the result back to
1797Perl's internal representation like so:
1798
1799    sub my_escape_html ($) {
1800        my($what) = shift;
1801        return unless defined $what;
1802        Encode::decode_utf8(Foo::Bar::escape_html(
1803                                         Encode::encode_utf8($what)));
1804    }
1805
1806Sometimes, when the extension does not convert data but just stores
1807and retrieves them, you will be able to use the otherwise
1808dangerous L<C<Encode::_utf8_on()>|Encode/_utf8_on> function. Let's say
1809the popular C<Foo::Bar> extension, written in C, provides a C<param>
1810method that lets you store and retrieve data according to these prototypes:
1811
1812    $self->param($name, $value);            # set a scalar
1813    $value = $self->param($name);           # retrieve a scalar
1814
1815If it does not yet provide support for any encoding, one could write a
1816derived class with such a C<param> method:
1817
1818    sub param {
1819      my($self,$name,$value) = @_;
1820      utf8::upgrade($name);     # make sure it is UTF-8 encoded
1821      if (defined $value) {
1822        utf8::upgrade($value);  # make sure it is UTF-8 encoded
1823        return $self->SUPER::param($name,$value);
1824      } else {
1825        my $ret = $self->SUPER::param($name);
1826        Encode::_utf8_on($ret); # we know, it is UTF-8 encoded
1827        return $ret;
1828      }
1829    }
1830
1831Some extensions provide filters on data entry/exit points, such as
1832C<DB_File::filter_store_key> and family. Look out for such filters in
1833the documentation of your extensions, they can make the transition to
1834Unicode data much easier.
1835
1836=head2 Speed
1837
1838Some functions are slower when working on UTF-8 encoded strings than
1839on byte encoded strings.  All functions that need to hop over
1840characters such as C<length()>, C<substr()> or C<index()>, or matching
1841regular expressions can work B<much> faster when the underlying data are
1842byte-encoded.
1843
1844In Perl 5.8.0 the slowness was often quite spectacular; in Perl 5.8.1
1845a caching scheme was introduced which will hopefully make the slowness
1846somewhat less spectacular, at least for some operations.  In general,
1847operations with UTF-8 encoded strings are still slower. As an example,
1848the Unicode properties (character classes) like C<\p{Nd}> are known to
1849be quite a bit slower (5-20 times) than their simpler counterparts
1850like C<\d> (then again, there are hundreds of Unicode characters matching C<Nd>
1851compared with the 10 ASCII characters matching C<d>).
1852
1853=head2 Problems on EBCDIC platforms
1854
1855There are several known problems with Perl on EBCDIC platforms.  If you
1856want to use Perl there, send email to perlbug@perl.org.
1857
1858In earlier versions, when byte and character data were concatenated,
1859the new string was sometimes created by
1860decoding the byte strings as I<ISO 8859-1 (Latin-1)>, even if the
1861old Unicode string used EBCDIC.
1862
1863If you find any of these, please report them as bugs.
1864
1865=head2 Porting code from perl-5.6.X
1866
1867Perl 5.8 has a different Unicode model from 5.6. In 5.6 the programmer
1868was required to use the C<utf8> pragma to declare that a given scope
1869expected to deal with Unicode data and had to make sure that only
1870Unicode data were reaching that scope. If you have code that is
1871working with 5.6, you will need some of the following adjustments to
1872your code. The examples are written such that the code will continue
1873to work under 5.6, so you should be safe to try them out.
1874
1875=over 3
1876
1877=item *
1878
1879A filehandle that should read or write UTF-8
1880
1881  if ($] > 5.008) {
1882    binmode $fh, ":encoding(utf8)";
1883  }
1884
1885=item *
1886
1887A scalar that is going to be passed to some extension
1888
1889Be it C<Compress::Zlib>, C<Apache::Request> or any extension that has no
1890mention of Unicode in the manpage, you need to make sure that the
1891UTF8 flag is stripped off. Note that at the time of this writing
1892(January 2012) the mentioned modules are not UTF-8-aware. Please
1893check the documentation to verify if this is still true.
1894
1895  if ($] > 5.008) {
1896    require Encode;
1897    $val = Encode::encode_utf8($val); # make octets
1898  }
1899
1900=item *
1901
1902A scalar we got back from an extension
1903
1904If you believe the scalar comes back as UTF-8, you will most likely
1905want the UTF8 flag restored:
1906
1907  if ($] > 5.008) {
1908    require Encode;
1909    $val = Encode::decode_utf8($val);
1910  }
1911
1912=item *
1913
1914Same thing, if you are really sure it is UTF-8
1915
1916  if ($] > 5.008) {
1917    require Encode;
1918    Encode::_utf8_on($val);
1919  }
1920
1921=item *
1922
1923A wrapper for L<DBI> C<fetchrow_array> and C<fetchrow_hashref>
1924
1925When the database contains only UTF-8, a wrapper function or method is
1926a convenient way to replace all your C<fetchrow_array> and
1927C<fetchrow_hashref> calls. A wrapper function will also make it easier to
1928adapt to future enhancements in your database driver. Note that at the
1929time of this writing (January 2012), the DBI has no standardized way
1930to deal with UTF-8 data. Please check the L<DBI documentation|DBI> to verify if
1931that is still true.
1932
1933  sub fetchrow {
1934    # $what is one of fetchrow_{array,hashref}
1935    my($self, $sth, $what) = @_;
1936    if ($] < 5.008) {
1937      return $sth->$what;
1938    } else {
1939      require Encode;
1940      if (wantarray) {
1941        my @arr = $sth->$what;
1942        for (@arr) {
1943          defined && /[^\000-\177]/ && Encode::_utf8_on($_);
1944        }
1945        return @arr;
1946      } else {
1947        my $ret = $sth->$what;
1948        if (ref $ret) {
1949          for my $k (keys %$ret) {
1950            defined
1951            && /[^\000-\177]/
1952            && Encode::_utf8_on($_) for $ret->{$k};
1953          }
1954          return $ret;
1955        } else {
1956          defined && /[^\000-\177]/ && Encode::_utf8_on($_) for $ret;
1957          return $ret;
1958        }
1959      }
1960    }
1961  }
1962
1963
1964=item *
1965
1966A large scalar that you know can only contain ASCII
1967
1968Scalars that contain only ASCII and are marked as UTF-8 are sometimes
1969a drag to your program. If you recognize such a situation, just remove
1970the UTF8 flag:
1971
1972  utf8::downgrade($val) if $] > 5.008;
1973
1974=back
1975
1976=head1 SEE ALSO
1977
1978L<perlunitut>, L<perluniintro>, L<perluniprops>, L<Encode>, L<open>, L<utf8>, L<bytes>,
1979L<perlretut>, L<perlvar/"${^UNICODE}">
1980L<http://www.unicode.org/reports/tr44>).
1981
1982=cut
1983