1=head1 NAME 2 3Encode::Supported -- Encodings supported by Encode 4 5=head1 DESCRIPTION 6 7=head2 Encoding Names 8 9Encoding names are case insensitive. White space in names 10is ignored. In addition, an encoding may have aliases. 11Each encoding has one "canonical" name. The "canonical" 12name is chosen from the names of the encoding by picking 13the first in the following sequence (with a few exceptions). 14 15=over 2 16 17=item * 18 19The name used by the Perl community. That includes 'utf8' and 'ascii'. 20Unlike aliases, canonical names directly reach the method so such 21frequently used words like 'utf8' don't need to do alias lookups. 22 23=item * 24 25The MIME name as defined in IETF RFCs. This includes all "iso-"s. 26 27=item * 28 29The name in the IANA registry. 30 31=item * 32 33The name used by the organization that defined it. 34 35=back 36 37In case I<de jure> canonical names differ from that of the Encode 38module, they are always aliased if it ever be implemented. So you can 39safely tell if a given encoding is implemented or not just by passing 40the canonical name. 41 42Because of all the alias issues, and because in the general case 43encodings have state, "Encode" uses an encoding object internally 44once an operation is in progress. 45 46=head1 Supported Encodings 47 48As of Perl 5.8.0, at least the following encodings are recognized. 49Note that unless otherwise specified, they are all case insensitive 50(via alias) and all occurrence of spaces are replaced with '-'. 51In other words, "ISO 8859 1" and "iso-8859-1" are identical. 52 53Encodings are categorized and implemented in several different modules 54but you don't have to C<use Encode::XX> to make them available for 55most cases. Encode.pm will automatically load those modules on demand. 56 57=head2 Built-in Encodings 58 59The following encodings are always available. 60 61 Canonical Aliases Comments & References 62 ---------------------------------------------------------------- 63 ascii US-ascii ISO-646-US [ECMA] 64 ascii-ctrl Special Encoding 65 iso-8859-1 latin1 [ISO] 66 null Special Encoding 67 utf8 UTF-8 [RFC2279] 68 ---------------------------------------------------------------- 69 70I<null> and I<ascii-ctrl> are special. "null" fails for all character 71so when you set fallback mode to PERLQQ, HTMLCREF or XMLCREF, ALL 72CHARACTERS will fall back to character references. Ditto for 73"ascii-ctrl" except for control characters. For fallback modes, see 74L<Encode>. 75 76=head2 Encode::Unicode -- other Unicode encodings 77 78Unicode coding schemes other than native utf8 are supported by 79Encode::Unicode, which will be autoloaded on demand. 80 81 ---------------------------------------------------------------- 82 UCS-2BE UCS-2, iso-10646-1 [IANA, UC] 83 UCS-2LE [UC] 84 UTF-16 [UC] 85 UTF-16BE [UC] 86 UTF-16LE [UC] 87 UTF-32 [UC] 88 UTF-32BE UCS-4 [UC] 89 UTF-32LE [UC] 90 UTF-7 [RFC2152] 91 ---------------------------------------------------------------- 92 93To find how (UCS-2|UTF-(16|32))(LE|BE)? differ from one another, 94see L<Encode::Unicode>. 95 96UTF-7 is a special encoding which "re-encodes" UTF-16BE into a 7-bit 97encoding. It is implemented separately by Encode::Unicode::UTF7. 98 99=head2 Encode::Byte -- Extended ASCII 100 101Encode::Byte implements most single-byte encodings except for 102Symbols and EBCDIC. The following encodings are based on single-byte 103encodings implemented as extended ASCII. Most of them map 104\x80-\xff (upper half) to non-ASCII characters. 105 106=over 2 107 108=item ISO-8859 and corresponding vendor mappings 109 110Since there are so many, they are presented in table format with 111languages and corresponding encoding names by vendors. Note that 112the table is sorted in order of ISO-8859 and the corresponding vendor 113mappings are slightly different from that of ISO. See 114L<http://czyborra.com/charsets/iso8859.html> for details. 115 116 Lang/Regions ISO/Other Std. DOS Windows Macintosh Others 117 ---------------------------------------------------------------- 118 N. America (ASCII) cp437 AdobeStandardEncoding 119 cp863 (DOSCanadaF) 120 W. Europe iso-8859-1 cp850 cp1252 MacRoman nextstep 121 hp-roman8 122 cp860 (DOSPortuguese) 123 Cntrl. Europe iso-8859-2 cp852 cp1250 MacCentralEurRoman 124 MacCroatian 125 MacRomanian 126 MacRumanian 127 Latin3[1] iso-8859-3 128 Latin4[2] iso-8859-4 129 Cyrillics iso-8859-5 cp855 cp1251 MacCyrillic 130 (See also next section) cp866 MacUkrainian 131 Arabic iso-8859-6 cp864 cp1256 MacArabic 132 cp1006 MacFarsi 133 Greek iso-8859-7 cp737 cp1253 MacGreek 134 cp869 (DOSGreek2) 135 Hebrew iso-8859-8 cp862 cp1255 MacHebrew 136 Turkish iso-8859-9 cp857 cp1254 MacTurkish 137 Nordics iso-8859-10 cp865 138 cp861 MacIcelandic 139 MacSami 140 Thai iso-8859-11[3] cp874 MacThai 141 (iso-8859-12 is nonexistent. Reserved for Indics?) 142 Baltics iso-8859-13 cp775 cp1257 143 Celtics iso-8859-14 144 Latin9 [4] iso-8859-15 145 Latin10 iso-8859-16 146 Vietnamese viscii cp1258 MacVietnamese 147 ---------------------------------------------------------------- 148 149 [1] Esperanto, Maltese, and Turkish. Turkish is now on 8859-9. 150 [2] Baltics. Now on 8859-10, except for Latvian. 151 [3] TIS 620 + Non-Breaking Space (0xA0 / U+00A0) 152 [4] Nicknamed Latin0; the Euro sign as well as French and Finnish 153 letters that are missing from 8859-1 were added. 154 155All cp* are also available as ibm-*, ms-*, and windows-* . See also 156L<http://czyborra.com/charsets/codepages.html>. 157 158Macintosh encodings don't seem to be registered in such entities as 159IANA. "Canonical" names in Encode are based upon Apple's Tech Note 1601150. See L<http://developer.apple.com/technotes/tn/tn1150.html> 161for details. 162 163=item KOI8 - De Facto Standard for the Cyrillic world 164 165Though ISO-8859 does have ISO-8859-5, the KOI8 series is far more 166popular in the Net. L<Encode> comes with the following KOI charsets. 167For gory details, see L<http://czyborra.com/charsets/cyrillic.html> 168 169 ---------------------------------------------------------------- 170 koi8-f 171 koi8-r cp878 [RFC1489] 172 koi8-u [RFC2319] 173 ---------------------------------------------------------------- 174 175=back 176 177=head2 gsm0338 - Hentai Latin 1 178 179GSM0338 is for GSM handsets. Though it shares alphanumerals with 180ASCII, control character ranges and other parts are mapped very 181differently, mainly to store Greek characters. There are also escape 182sequences (starting with 0x1B) to cover e.g. the Euro sign. 183 184This was once handled by L<Encode::Bytes> but because of all those 185unusual specifications, Encode 2.20 has relocated the support to 186L<Encode::GSM0338>. See L<Encode::GSM0338> for details. 187 188=over 2 189 190=item gsm0338 support before 2.19 191 192Some special cases like a trailing 0x00 byte or a lone 0x1B byte are not 193well-defined and decode() will return an empty string for them. 194One possible workaround is 195 196 $gsm =~ s/\x00\z/\x00\x00/; 197 $uni = decode("gsm0338", $gsm); 198 $uni .= "\xA0" if $gsm =~ /\x1B\z/; 199 200Note that the Encode implementation of GSM0338 does not implement the 201reuse of Latin capital letters as Greek capital letters (for example, 202the 0x5A is U+005A (LATIN CAPITAL LETTER Z), not U+0396 (GREEK CAPITAL 203LETTER ZETA). 204 205The GSM0338 is also covered in Encode::Byte even though it is not 206an "extended ASCII" encoding. 207 208=back 209 210=head2 CJK: Chinese, Japanese, Korean (Multibyte) 211 212Note that Vietnamese is listed above. Also read "Encoding vs Charset" 213below. Also note that these are implemented in distinct modules by 214countries, due to the size concerns (simplified Chinese is mapped 215to 'CN', continental China, while traditional Chinese is mapped to 216'TW', Taiwan). Please refer to their respective documentation pages. 217 218=over 2 219 220=item Encode::CN -- Continental China 221 222 Standard DOS/Win Macintosh Comment/Reference 223 ---------------------------------------------------------------- 224 euc-cn [1] MacChineseSimp 225 (gbk) cp936 [2] 226 gb12345-raw { GB12345 without CES } 227 gb2312-raw { GB2312 without CES } 228 hz 229 iso-ir-165 230 ---------------------------------------------------------------- 231 232 [1] GB2312 is aliased to this. See L<Microsoft-related naming mess> 233 [2] gbk is aliased to this. See L<Microsoft-related naming mess> 234 235=item Encode::JP -- Japan 236 237 Standard DOS/Win Macintosh Comment/Reference 238 ---------------------------------------------------------------- 239 euc-jp 240 shiftjis cp932 macJapanese 241 7bit-jis 242 iso-2022-jp [RFC1468] 243 iso-2022-jp-1 [RFC2237] 244 jis0201-raw { JIS X 0201 (roman + halfwidth kana) without CES } 245 jis0208-raw { JIS X 0208 (Kanji + fullwidth kana) without CES } 246 jis0212-raw { JIS X 0212 (Extended Kanji) without CES } 247 ---------------------------------------------------------------- 248 249=item Encode::KR -- Korea 250 251 Standard DOS/Win Macintosh Comment/Reference 252 ---------------------------------------------------------------- 253 euc-kr MacKorean [RFC1557] 254 cp949 [1] 255 iso-2022-kr [RFC1557] 256 johab [KS X 1001:1998, Annex 3] 257 ksc5601-raw { KSC5601 without CES } 258 ---------------------------------------------------------------- 259 260 [1] ks_c_5601-1987, (x-)?windows-949, and uhc are aliased to this. 261 See below. 262 263=item Encode::TW -- Taiwan 264 265 Standard DOS/Win Macintosh Comment/Reference 266 ---------------------------------------------------------------- 267 big5-eten cp950 MacChineseTrad {big5 aliased to big5-eten} 268 big5-hkscs 269 ---------------------------------------------------------------- 270 271=item Encode::HanExtra -- More Chinese via CPAN 272 273Due to the size concerns, additional Chinese encodings below are 274distributed separately on CPAN, under the name Encode::HanExtra. 275 276 Standard DOS/Win Macintosh Comment/Reference 277 ---------------------------------------------------------------- 278 big5ext CMEX's Big5e Extension 279 big5plus CMEX's Big5+ Extension 280 cccii Chinese Character Code for Information Interchange 281 euc-tw EUC (Extended Unix Character) 282 gb18030 GBK with Traditional Characters 283 ---------------------------------------------------------------- 284 285=item Encode::JIS2K -- JIS X 0213 encodings via CPAN 286 287Due to size concerns, additional Japanese encodings below are 288distributed separately on CPAN, under the name Encode::JIS2K. 289 290 Standard DOS/Win Macintosh Comment/Reference 291 ---------------------------------------------------------------- 292 euc-jisx0213 293 shiftjisx0123 294 iso-2022-jp-3 295 jis0213-1-raw 296 jis0213-2-raw 297 ---------------------------------------------------------------- 298 299=back 300 301=head2 Miscellaneous encodings 302 303=over 2 304 305=item Encode::EBCDIC 306 307See L<perlebcdic> for details. 308 309 ---------------------------------------------------------------- 310 cp37 311 cp500 312 cp875 313 cp1026 314 cp1047 315 posix-bc 316 ---------------------------------------------------------------- 317 318=item Encode::Symbols 319 320For symbols and dingbats. 321 322 ---------------------------------------------------------------- 323 symbol 324 dingbats 325 MacDingbats 326 AdobeZdingbat 327 AdobeSymbol 328 ---------------------------------------------------------------- 329 330=item Encode::MIME::Header 331 332Strictly speaking, MIME header encoding documented in RFC 2047 is more 333of encapsulation than encoding. However, their support in modern 334world is imperative so they are supported. 335 336 ---------------------------------------------------------------- 337 MIME-Header [RFC2047] 338 MIME-B [RFC2047] 339 MIME-Q [RFC2047] 340 ---------------------------------------------------------------- 341 342=item Encode::Guess 343 344This one is not a name of encoding but a utility that lets you pick up 345the most appropriate encoding for a data out of given I<suspects>. See 346L<Encode::Guess> for details. 347 348=back 349 350=head1 Unsupported encodings 351 352The following encodings are not supported as yet; some because they 353are rarely used, some because of technical difficulties. They may 354be supported by external modules via CPAN in the future, however. 355 356=over 2 357 358=item ISO-2022-JP-2 [RFC1554] 359 360Not very popular yet. Needs Unicode Database or equivalent to 361implement encode() (because it includes JIS X 0208/0212, KSC5601, and 362GB2312 simultaneously, whose code points in Unicode overlap. So you 363need to lookup the database to determine to what character set a given 364Unicode character should belong). 365 366=item ISO-2022-CN [RFC1922] 367 368Not very popular. Needs CNS 11643-1 and -2 which are not available in 369this module. CNS 11643 is supported (via euc-tw) in Encode::HanExtra. 370Audrey Tang may add support for this encoding in her module in future. 371 372=item Various HP-UX encodings 373 374The following are unsupported due to the lack of mapping data. 375 376 '8' - arabic8, greek8, hebrew8, kana8, thai8, and turkish8 377 '15' - japanese15, korean15, and roi15 378 379=item Cyrillic encoding ISO-IR-111 380 381Anton Tagunov doubts its usefulness. 382 383=item ISO-8859-8-1 [Hebrew] 384 385None of the Encode team knows Hebrew enough (ISO-8859-8, cp1255 and 386MacHebrew are supported because and just because there were mappings 387available at L<http://www.unicode.org/>). Contributions welcome. 388 389=item ISIRI 3342, Iran System, ISIRI 2900 [Farsi] 390 391Ditto. 392 393=item Thai encoding TCVN 394 395Ditto. 396 397=item Vietnamese encodings VPS 398 399Though Jungshik Shin has reported that Mozilla supports this encoding, 400it was too late before 5.8.0 for us to add it. In the future, it 401may be available via a separate module. See 402L<http://lxr.mozilla.org/seamonkey/source/intl/uconv/ucvlatin/vps.uf> 403and 404L<http://lxr.mozilla.org/seamonkey/source/intl/uconv/ucvlatin/vps.ut> 405if you are interested in helping us. 406 407=item Various Mac encodings 408 409The following are unsupported due to the lack of mapping data. 410 411 MacArmenian, MacBengali, MacBurmese, MacEthiopic 412 MacExtArabic, MacGeorgian, MacKannada, MacKhmer 413 MacLaotian, MacMalayalam, MacMongolian, MacOriya 414 MacSinhalese, MacTamil, MacTelugu, MacTibetan 415 MacVietnamese 416 417The rest which are already available are based upon the vendor mappings 418at L<http://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/> . 419 420=item (Mac) Indic encodings 421 422The maps for the following are available at L<http://www.unicode.org/> 423but remain unsupported because those encodings need an algorithmical 424approach, currently unsupported by F<enc2xs>: 425 426 MacDevanagari 427 MacGurmukhi 428 MacGujarati 429 430For details, please see C<Unicode mapping issues and notes:> at 431L<http://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/DEVANAGA.TXT> . 432 433I believe this issue is prevalent not only for Mac Indics but also in 434other Indic encodings, but the above were the only Indic encodings 435maps that I could find at L<http://www.unicode.org/> . 436 437=back 438 439=head1 Encoding vs. Charset -- terminology 440 441We are used to using the term (character) I<encoding> and I<character 442set> interchangeably. But just as confusing the terms byte and 443character is dangerous and the terms should be differentiated when 444needed, we need to differentiate I<encoding> and I<character set>. 445 446To understand that, here is a description of how we make computers 447grok our characters. 448 449=over 2 450 451=item * 452 453First we start with which characters to include. We call this 454collection of characters I<character repertoire>. 455 456=item * 457 458Then we have to give each character a unique ID so your computer can 459tell the difference between 'a' and 'A'. This itemized character 460repertoire is now a I<character set>. 461 462=item * 463 464If your computer can grow the character set without further 465processing, you can go ahead and use it. This is called a I<coded 466character set> (CCS) or I<raw character encoding>. ASCII is used this 467way for most cases. 468 469=item * 470 471But in many cases, especially multi-byte CJK encodings, you have to 472tweak a little more. Your network connection may not accept any data 473with the Most Significant Bit set, and your computer may not be able to 474tell if a given byte is a whole character or just half of it. So you 475have to I<encode> the character set to use it. 476 477A I<character encoding scheme> (CES) determines how to encode a given 478character set, or a set of multiple character sets. 7bit ISO-2022 is 479an example of a CES. You switch between character sets via I<escape 480sequences>. 481 482=back 483 484Technically, or mathematically, speaking, a character set encoded in 485such a CES that maps character by character may form a CCS. EUC is such 486an example. The CES of EUC is as follows: 487 488=over 2 489 490=item * 491 492Map ASCII unchanged. 493 494=item * 495 496Map such a character set that consists of 94 or 96 powered by N 497members by adding 0x80 to each byte. 498 499=item * 500 501You can also use 0x8e and 0x8f to indicate that the following sequence of 502characters belongs to yet another character set. To each following byte 503is added the value 0x80. 504 505=back 506 507By carefully looking at the encoded byte sequence, you can find that the 508byte sequence conforms a unique number. In that sense, EUC is a CCS 509generated by a CES above from up to four CCS (complicated?). UTF-8 510falls into this category. See L<perlUnicode/"UTF-8"> to find out how 511UTF-8 maps Unicode to a byte sequence. 512 513You may also have found out by now why 7bit ISO-2022 cannot comprise 514a CCS. If you look at a byte sequence \x21\x21, you can't tell if 515it is two !'s or IDEOGRAPHIC SPACE. EUC maps the latter to \xA1\xA1 516so you have no trouble differentiating between "!!". and S<" ">. 517 518=head1 Encoding Classification (by Anton Tagunov and Dan Kogai) 519 520This section tries to classify the supported encodings by their 521applicability for information exchange over the Internet and to 522choose the most suitable aliases to name them in the context of 523such communication. 524 525=over 2 526 527=item * 528 529To (en|de)code encodings marked by C<(**)>, you need 530C<Encode::HanExtra>, available from CPAN. 531 532=back 533 534Encoding names 535 536 US-ASCII UTF-8 ISO-8859-* KOI8-R 537 Shift_JIS EUC-JP ISO-2022-JP ISO-2022-JP-1 538 EUC-KR Big5 GB2312 539 540are registered with IANA as preferred MIME names and may 541be used over the Internet. 542 543C<Shift_JIS> has been officialized by JIS X 0208:1997. 544L<Microsoft-related naming mess> gives details. 545 546C<GB2312> is the IANA name for C<EUC-CN>. 547See L<Microsoft-related naming mess> for details. 548 549C<GB_2312-80> I<raw> encoding is available as C<gb2312-raw> 550with Encode. See L<Encode::CN> for details. 551 552 EUC-CN 553 KOI8-U [RFC2319] 554 555have not been registered with IANA (as of March 2002) but 556seem to be supported by major web browsers. 557The IANA name for C<EUC-CN> is C<GB2312>. 558 559 KS_C_5601-1987 560 561is heavily misused. 562See L<Microsoft-related naming mess> for details. 563 564C<KS_C_5601-1987> I<raw> encoding is available as C<kcs5601-raw> 565with Encode. See L<Encode::KR> for details. 566 567 UTF-16 UTF-16BE UTF-16LE 568 569are IANA-registered C<charset>s. See [RFC 2781] for details. 570Jungshik Shin reports that UTF-16 with a BOM is well accepted 571by MS IE 5/6 and NS 4/6. Beware however that 572 573=over 2 574 575=item * 576 577C<UTF-16> support in any software you're going to be 578using/interoperating with has probably been less tested 579then C<UTF-8> support 580 581=item * 582 583C<UTF-8> coded data seamlessly passes traditional 584command piping (C<cat>, C<more>, etc.) while C<UTF-16> coded 585data is likely to cause confusion (with its zero bytes, 586for example) 587 588=item * 589 590it is beyond the power of words to describe the way HTML browsers 591encode non-C<ASCII> form data. To get a general impression, visit 592L<http://www.alanflavell.org.uk/charset/form-i18n.html>. 593While encoding of form data has stabilized for C<UTF-8> encoded pages 594(at least IE 5/6, NS 6, and Opera 6 behave consistently), be sure to 595expect fun (and cross-browser discrepancies) with C<UTF-16> encoded 596pages! 597 598=back 599 600The rule of thumb is to use C<UTF-8> unless you know what 601you're doing and unless you really benefit from using C<UTF-16>. 602 603 ISO-IR-165 [RFC1345] 604 VISCII 605 GB 12345 606 GB 18030 (**) (see links below) 607 EUC-TW (**) 608 609are totally valid encodings but not registered at IANA. 610The names under which they are listed here are probably the 611most widely-known names for these encodings and are recommended 612names. 613 614 BIG5PLUS (**) 615 616is a proprietary name. 617 618=head2 Microsoft-related naming mess 619 620Microsoft products misuse the following names: 621 622=over 2 623 624=item KS_C_5601-1987 625 626Microsoft extension to C<EUC-KR>. 627 628Proper names: C<CP949>, C<UHC>, C<x-windows-949> (as used by Mozilla). 629 630See L<http://lists.w3.org/Archives/Public/ietf-charsets/2001AprJun/0033.html> 631for details. 632 633Encode aliases C<KS_C_5601-1987> to C<cp949> to reflect this common 634misusage. I<Raw> C<KS_C_5601-1987> encoding is available as 635C<kcs5601-raw>. 636 637See L<Encode::KR> for details. 638 639=item GB2312 640 641Microsoft extension to C<EUC-CN>. 642 643Proper names: C<CP936>, C<GBK>. 644 645C<GB2312> has been registered in the C<EUC-CN> meaning at 646IANA. This has partially repaired the situation: Microsoft's 647C<GB2312> has become a superset of the official C<GB2312>. 648 649Encode aliases C<GB2312> to C<euc-cn> in full agreement with 650IANA registration. C<cp936> is supported separately. 651I<Raw> C<GB_2312-80> encoding is available as C<gb2312-raw>. 652 653See L<Encode::CN> for details. 654 655=item Big5 656 657Microsoft extension to C<Big5>. 658 659Proper name: C<CP950>. 660 661Encode separately supports C<Big5> and C<cp950>. 662 663=item Shift_JIS 664 665Microsoft's understanding of C<Shift_JIS>. 666 667JIS has not endorsed the full Microsoft standard however. 668The official C<Shift_JIS> includes only JIS X 0201 and JIS X 0208 669character sets, while Microsoft has always used C<Shift_JIS> 670to encode a wider character repertoire. See C<IANA> registration for 671C<Windows-31J>. 672 673As a historical predecessor, Microsoft's variant 674probably has more rights for the name, though it may be objected 675that Microsoft shouldn't have used JIS as part of the name 676in the first place. 677 678Unambiguous name: C<CP932>. C<IANA> name (also used by Mozilla, and 679provided as an alias by Encode): C<Windows-31J>. 680 681Encode separately supports C<Shift_JIS> and C<cp932>. 682 683=back 684 685=head1 Glossary 686 687=over 2 688 689=item character repertoire 690 691A collection of unique characters. A I<character> set in the strictest 692sense. At this stage, characters are not numbered. 693 694=item coded character set (CCS) 695 696A character set that is mapped in a way computers can use directly. 697Many character encodings, including EUC, fall in this category. 698 699=item character encoding scheme (CES) 700 701An algorithm to map a character set to a byte sequence. You don't 702have to be able to tell which character set a given byte sequence 703belongs. 7-bit ISO-2022 is a CES but it cannot be a CCS. EUC is an 704example of being both a CCS and CES. 705 706=item charset (in MIME context) 707 708has long been used in the meaning of C<encoding>, CES. 709 710While the word combination C<character set> has lost this meaning 711in MIME context since [RFC 2130], the C<charset> abbreviation has 712retained it. This is how [RFC 2277] and [RFC 2278] bless C<charset>: 713 714 This document uses the term "charset" to mean a set of rules for 715 mapping from a sequence of octets to a sequence of characters, such 716 as the combination of a coded character set and a character encoding 717 scheme; this is also what is used as an identifier in MIME "charset=" 718 parameters, and registered in the IANA charset registry ... (Note 719 that this is NOT a term used by other standards bodies, such as ISO). 720 [RFC 2277] 721 722=item EUC 723 724Extended Unix Character. See ISO-2022. 725 726=item ISO-2022 727 728A CES that was carefully designed to coexist with ASCII. There are a 7 729bit version and an 8 bit version. 730 731The 7 bit version switches character set via escape sequence so it 732cannot form a CCS. Since this is more difficult to handle in programs 733than the 8 bit version, the 7 bit version is not very popular except for 734iso-2022-jp, the I<de facto> standard CES for e-mails. 735 736The 8 bit version can form a CCS. EUC and ISO-8859 are two examples 737thereof. Pre-5.6 perl could use them as string literals. 738 739=item UCS 740 741Short for I<Universal Character Set>. When you say just UCS, it means 742I<Unicode>. 743 744=item UCS-2 745 746ISO/IEC 10646 encoding form: Universal Character Set coded in two 747octets. 748 749=item Unicode 750 751A character set that aims to include all character repertoires of the 752world. Many character sets in various national as well as industrial 753standards have become, in a way, just subsets of Unicode. 754 755=item UTF 756 757Short for I<Unicode Transformation Format>. Determines how to map a 758Unicode character into a byte sequence. 759 760=item UTF-16 761 762A UTF in 16-bit encoding. Can either be in big endian or little 763endian. The big endian version is called UTF-16BE (equal to UCS-2 + 764surrogate support) and the little endian version is called UTF-16LE. 765 766=back 767 768=head1 See Also 769 770L<Encode>, 771L<Encode::Byte>, 772L<Encode::CN>, L<Encode::JP>, L<Encode::KR>, L<Encode::TW>, 773L<Encode::EBCDIC>, L<Encode::Symbol> 774L<Encode::MIME::Header>, L<Encode::Guess> 775 776=head1 References 777 778=over 2 779 780=item ECMA 781 782European Computer Manufacturers Association 783L<http://www.ecma.ch> 784 785=over 2 786 787=item ECMA-035 (eq C<ISO-2022>) 788 789L<http://www.ecma.ch/ecma1/STAND/ECMA-035.HTM> 790 791The specification of ISO-2022 is available from the link above. 792 793=back 794 795=item IANA 796 797Internet Assigned Numbers Authority 798L<http://www.iana.org/> 799 800=over 2 801 802=item Assigned Charset Names by IANA 803 804L<http://www.iana.org/assignments/character-sets> 805 806Most of the C<canonical names> in Encode derive from this list 807so you can directly apply the string you have extracted from MIME 808header of mails and web pages. 809 810=back 811 812=item ISO 813 814International Organization for Standardization 815L<http://www.iso.ch/> 816 817=item RFC 818 819Request For Comments -- need I say more? 820L<http://www.rfc-editor.org/>, L<http://www.ietf.org/rfc.html>, 821L<http://www.faqs.org/rfcs/> 822 823=item UC 824 825Unicode Consortium 826L<http://www.unicode.org/> 827 828=over 2 829 830=item Unicode Glossary 831 832L<http://www.unicode.org/glossary/> 833 834The glossary of this document is based upon this site. 835 836=back 837 838=back 839 840=head2 Other Notable Sites 841 842=over 2 843 844=item czyborra.com 845 846L<http://czyborra.com/> 847 848Contains a lot of useful information, especially gory details of ISO 849vs. vendor mappings. 850 851=item CJK.inf 852 853L<http://examples.oreilly.com/cjkvinfo/doc/cjk.inf> 854 855Somewhat obsolete (last update in 1996), but still useful. Also try 856 857L<ftp://ftp.oreilly.com/pub/examples/nutshell/cjkv/pdf/GB18030_Summary.pdf> 858 859You will find brief info on C<EUC-CN>, C<GBK> and mostly on C<GB 18030>. 860 861=item Jungshik Shin's Hangul FAQ 862 863L<http://jshin.net/faq> 864 865And especially its subject 8. 866 867L<http://jshin.net/faq/qa8.html> 868 869A comprehensive overview of the Korean (C<KS *>) standards. 870 871=item debian.org: "Introduction to i18n" 872 873A brief description for most of the mentioned CJK encodings is 874contained in 875L<http://www.debian.org/doc/manuals/intro-i18n/ch-codes.en.html> 876 877=back 878 879=head2 Offline sources 880 881=over 2 882 883=item C<CJKV Information Processing> by Ken Lunde 884 885CJKV Information Processing 8861999 O'Reilly & Associates, ISBN : 1-56592-224-7 887 888The modern successor of C<CJK.inf>. 889 890Features a comprehensive coverage of CJKV character sets and 891encodings along with many other issues faced by anyone trying 892to better support CJKV languages/scripts in all the areas of 893information processing. 894 895To purchase this book, visit 896L<http://oreilly.com/catalog/9780596514471/> 897or your favourite bookstore. 898 899=back 900 901=cut 902