1=head1 NAME
2
3Encode::Supported -- Encodings supported by Encode
4
5=head1 DESCRIPTION
6
7=head2 Encoding Names
8
9Encoding names are case insensitive. White space in names
10is ignored.  In addition, an encoding may have aliases.
11Each encoding has one "canonical" name.  The "canonical"
12name is chosen from the names of the encoding by picking
13the first in the following sequence (with a few exceptions).
14
15=over 2
16
17=item *
18
19The name used by the Perl community.  That includes 'utf8' and 'ascii'.
20Unlike aliases, canonical names directly reach the method so such
21frequently used words like 'utf8' don't need to do alias lookups.
22
23=item *
24
25The MIME name as defined in IETF RFCs.  This includes all "iso-"s.
26
27=item *
28
29The name in the IANA registry.
30
31=item *
32
33The name used by the organization that defined it.
34
35=back
36
37In case I<de jure> canonical names differ from that of the Encode
38module, they are always aliased if it ever be implemented.  So you can
39safely tell if a given encoding is implemented or not just by passing
40the canonical name.
41
42Because of all the alias issues, and because in the general case
43encodings have state, "Encode" uses an encoding object internally
44once an operation is in progress.
45
46=head1 Supported Encodings
47
48As of Perl 5.8.0, at least the following encodings are recognized.
49Note that unless otherwise specified, they are all case insensitive
50(via alias) and all occurrence of spaces are replaced with '-'.
51In other words, "ISO 8859 1" and "iso-8859-1" are identical.
52
53Encodings are categorized and implemented in several different modules
54but you don't have to C<use Encode::XX> to make them available for
55most cases.  Encode.pm will automatically load those modules on demand.
56
57=head2 Built-in Encodings
58
59The following encodings are always available.
60
61  Canonical     Aliases                      Comments & References
62  ----------------------------------------------------------------
63  ascii         US-ascii ISO-646-US                         [ECMA]
64  ascii-ctrl			                  Special Encoding
65  iso-8859-1    latin1                                       [ISO]
66  null				                  Special Encoding
67  utf8          UTF-8                                    [RFC2279]
68  ----------------------------------------------------------------
69
70I<null> and I<ascii-ctrl> are special.  "null" fails for all character
71so when you set fallback mode to PERLQQ, HTMLCREF or XMLCREF, ALL
72CHARACTERS will fall back to character references.  Ditto for
73"ascii-ctrl" except for control characters.  For fallback modes, see
74L<Encode>.
75
76=head2 Encode::Unicode -- other Unicode encodings
77
78Unicode coding schemes other than native utf8 are supported by
79Encode::Unicode, which will be autoloaded on demand.
80
81  ----------------------------------------------------------------
82  UCS-2BE       UCS-2, iso-10646-1                      [IANA, UC]
83  UCS-2LE                                                     [UC]
84  UTF-16                                                      [UC]
85  UTF-16BE                                                    [UC]
86  UTF-16LE                                                    [UC]
87  UTF-32                                                      [UC]
88  UTF-32BE	UCS-4                                         [UC]
89  UTF-32LE                                                    [UC]
90  UTF-7                                                  [RFC2152]
91  ----------------------------------------------------------------
92
93To find how (UCS-2|UTF-(16|32))(LE|BE)? differ from one another,
94see L<Encode::Unicode>.
95
96UTF-7 is a special encoding which "re-encodes" UTF-16BE into a 7-bit
97encoding.  It is implemented separately by Encode::Unicode::UTF7.
98
99=head2 Encode::Byte -- Extended ASCII
100
101Encode::Byte implements most single-byte encodings except for
102Symbols and EBCDIC. The following encodings are based on single-byte
103encodings implemented as extended ASCII.  Most of them map
104\x80-\xff (upper half) to non-ASCII characters.
105
106=over 2
107
108=item ISO-8859 and corresponding vendor mappings
109
110Since there are so many, they are presented in table format with
111languages and corresponding encoding names by vendors.  Note that
112the table is sorted in order of ISO-8859 and the corresponding vendor
113mappings are slightly different from that of ISO.  See
114L<http://czyborra.com/charsets/iso8859.html> for details.
115
116  Lang/Regions  ISO/Other Std.  DOS     Windows Macintosh  Others
117  ----------------------------------------------------------------
118  N. America    (ASCII)         cp437        AdobeStandardEncoding
119                                cp863 (DOSCanadaF)
120  W. Europe     iso-8859-1      cp850   cp1252  MacRoman  nextstep
121                                                         hp-roman8
122                                cp860 (DOSPortuguese)
123  Cntrl. Europe iso-8859-2      cp852   cp1250  MacCentralEurRoman
124                                                MacCroatian
125                                                MacRomanian
126                                                MacRumanian
127  Latin3[1]     iso-8859-3
128  Latin4[2]     iso-8859-4
129  Cyrillics     iso-8859-5      cp855   cp1251  MacCyrillic
130    (See also next section)     cp866           MacUkrainian
131  Arabic        iso-8859-6      cp864   cp1256  MacArabic
132                                cp1006          MacFarsi
133  Greek         iso-8859-7      cp737   cp1253  MacGreek
134                                cp869 (DOSGreek2)
135  Hebrew        iso-8859-8      cp862   cp1255  MacHebrew
136  Turkish       iso-8859-9      cp857   cp1254  MacTurkish
137  Nordics       iso-8859-10     cp865
138                                cp861           MacIcelandic
139                                                MacSami
140  Thai          iso-8859-11[3]  cp874           MacThai
141  (iso-8859-12 is nonexistent. Reserved for Indics?)
142  Baltics       iso-8859-13     cp775           cp1257
143  Celtics       iso-8859-14
144  Latin9 [4]    iso-8859-15
145  Latin10       iso-8859-16
146  Vietnamese    viscii                  cp1258  MacVietnamese
147  ----------------------------------------------------------------
148
149  [1] Esperanto, Maltese, and Turkish. Turkish is now on 8859-9.
150  [2] Baltics.  Now on 8859-10, except for Latvian.
151  [3] TIS 620 +  Non-Breaking Space (0xA0 / U+00A0)
152  [4] Nicknamed Latin0; the Euro sign as well as French and Finnish
153      letters that are missing from 8859-1 were added.
154
155All cp* are also available as ibm-*, ms-*, and windows-* .  See also
156L<http://czyborra.com/charsets/codepages.html>.
157
158Macintosh encodings don't seem to be registered in such entities as
159IANA.  "Canonical" names in Encode are based upon Apple's Tech Note
1601150.  See L<http://developer.apple.com/technotes/tn/tn1150.html>
161for details.
162
163=item KOI8 - De Facto Standard for the Cyrillic world
164
165Though ISO-8859 does have ISO-8859-5, the KOI8 series is far more
166popular in the Net.   L<Encode> comes with the following KOI charsets.
167For gory details, see L<http://czyborra.com/charsets/cyrillic.html>
168
169  ----------------------------------------------------------------
170  koi8-f
171  koi8-r cp878                                           [RFC1489]
172  koi8-u                                                 [RFC2319]
173  ----------------------------------------------------------------
174
175=back
176
177=head2 gsm0338 - Hentai Latin 1
178
179GSM0338 is for GSM handsets. Though it shares alphanumerals with
180ASCII, control character ranges and other parts are mapped very
181differently, mainly to store Greek characters.  There are also escape
182sequences (starting with 0x1B) to cover e.g. the Euro sign.
183
184This was once handled by L<Encode::Bytes> but because of all those
185unusual specifications, Encode 2.20 has relocated the support to
186L<Encode::GSM0338>. See L<Encode::GSM0338> for details.
187
188=over 2
189
190=item gsm0338 support before 2.19
191
192Some special cases like a trailing 0x00 byte or a lone 0x1B byte are not
193well-defined and decode() will return an empty string for them.
194One possible workaround is
195
196   $gsm =~ s/\x00\z/\x00\x00/;
197   $uni = decode("gsm0338", $gsm);
198   $uni .= "\xA0" if $gsm =~ /\x1B\z/;
199
200Note that the Encode implementation of GSM0338 does not implement the
201reuse of Latin capital letters as Greek capital letters (for example,
202the 0x5A is U+005A (LATIN CAPITAL LETTER Z), not U+0396 (GREEK CAPITAL
203LETTER ZETA).
204
205The GSM0338 is also covered in Encode::Byte even though it is not
206an "extended ASCII" encoding.
207
208=back
209
210=head2 CJK: Chinese, Japanese, Korean (Multibyte)
211
212Note that Vietnamese is listed above.  Also read "Encoding vs Charset"
213below.  Also note that these are implemented in distinct modules by
214countries, due to the size concerns (simplified Chinese is mapped
215to 'CN', continental China, while traditional Chinese is mapped to
216'TW', Taiwan).  Please refer to their respective documentation pages.
217
218=over 2
219
220=item Encode::CN -- Continental China
221
222  Standard      DOS/Win Macintosh                Comment/Reference
223  ----------------------------------------------------------------
224  euc-cn [1]            MacChineseSimp
225  (gbk)         cp936 [2]
226  gb12345-raw                      { GB12345 without CES }
227  gb2312-raw                       { GB2312  without CES }
228  hz
229  iso-ir-165
230  ----------------------------------------------------------------
231
232  [1] GB2312 is aliased to this.  See L<Microsoft-related naming mess>
233  [2] gbk is aliased to this.  See L<Microsoft-related naming mess>
234
235=item Encode::JP -- Japan
236
237  Standard      DOS/Win Macintosh                Comment/Reference
238  ----------------------------------------------------------------
239  euc-jp
240  shiftjis      cp932   macJapanese
241  7bit-jis
242  iso-2022-jp                                            [RFC1468]
243  iso-2022-jp-1                                          [RFC2237]
244  jis0201-raw  { JIS X 0201 (roman + halfwidth kana) without CES }
245  jis0208-raw  { JIS X 0208 (Kanji + fullwidth kana) without CES }
246  jis0212-raw  { JIS X 0212 (Extended Kanji)         without CES }
247  ----------------------------------------------------------------
248
249=item Encode::KR -- Korea
250
251  Standard      DOS/Win Macintosh                Comment/Reference
252  ----------------------------------------------------------------
253  euc-kr                MacKorean                        [RFC1557]
254                cp949 [1]
255  iso-2022-kr                                            [RFC1557]
256  johab                                  [KS X 1001:1998, Annex 3]
257  ksc5601-raw                              { KSC5601 without CES }
258  ----------------------------------------------------------------
259
260  [1] ks_c_5601-1987, (x-)?windows-949, and uhc are aliased to this.
261  See below.
262
263=item Encode::TW -- Taiwan
264
265  Standard      DOS/Win Macintosh                Comment/Reference
266  ----------------------------------------------------------------
267  big5-eten     cp950   MacChineseTrad {big5 aliased to big5-eten}
268  big5-hkscs
269  ----------------------------------------------------------------
270
271=item Encode::HanExtra -- More Chinese via CPAN
272
273Due to the size concerns, additional Chinese encodings below are
274distributed separately on CPAN, under the name Encode::HanExtra.
275
276  Standard      DOS/Win Macintosh                Comment/Reference
277  ----------------------------------------------------------------
278  big5ext                                   CMEX's Big5e Extension
279  big5plus                                  CMEX's Big5+ Extension
280  cccii         Chinese Character Code for Information Interchange
281  euc-tw                             EUC (Extended Unix Character)
282  gb18030                          GBK with Traditional Characters
283  ----------------------------------------------------------------
284
285=item Encode::JIS2K -- JIS X 0213 encodings via CPAN
286
287Due to size concerns, additional Japanese encodings below are
288distributed separately on CPAN, under the name Encode::JIS2K.
289
290  Standard      DOS/Win Macintosh                Comment/Reference
291  ----------------------------------------------------------------
292  euc-jisx0213
293  shiftjisx0123
294  iso-2022-jp-3
295  jis0213-1-raw
296  jis0213-2-raw
297  ----------------------------------------------------------------
298
299=back
300
301=head2 Miscellaneous encodings
302
303=over 2
304
305=item Encode::EBCDIC
306
307See L<perlebcdic> for details.
308
309  ----------------------------------------------------------------
310  cp37
311  cp500
312  cp875
313  cp1026
314  cp1047
315  posix-bc
316  ----------------------------------------------------------------
317
318=item Encode::Symbols
319
320For symbols  and dingbats.
321
322  ----------------------------------------------------------------
323  symbol
324  dingbats
325  MacDingbats
326  AdobeZdingbat
327  AdobeSymbol
328  ----------------------------------------------------------------
329
330=item Encode::MIME::Header
331
332Strictly speaking, MIME header encoding documented in RFC 2047 is more
333of encapsulation than encoding.  However, their support in modern
334world is imperative so they are supported.
335
336  ----------------------------------------------------------------
337  MIME-Header                                            [RFC2047]
338  MIME-B                                                 [RFC2047]
339  MIME-Q                                                 [RFC2047]
340  ----------------------------------------------------------------
341
342=item Encode::Guess
343
344This one is not a name of encoding but a utility that lets you pick up
345the most appropriate encoding for a data out of given I<suspects>.  See
346L<Encode::Guess> for details.
347
348=back
349
350=head1 Unsupported encodings
351
352The following encodings are not supported as yet; some because they
353are rarely used, some because of technical difficulties.  They may
354be supported by external modules via CPAN in the future, however.
355
356=over 2
357
358=item   ISO-2022-JP-2 [RFC1554]
359
360Not very popular yet.  Needs Unicode Database or equivalent to
361implement encode() (because it includes JIS X 0208/0212, KSC5601, and
362GB2312 simultaneously, whose code points in Unicode overlap.  So you
363need to lookup the database to determine to what character set a given
364Unicode character should belong).
365
366=item ISO-2022-CN [RFC1922]
367
368Not very popular.  Needs CNS 11643-1 and -2 which are not available in
369this module.  CNS 11643 is supported (via euc-tw) in Encode::HanExtra.
370Audrey Tang may add support for this encoding in her module in future.
371
372=item Various HP-UX encodings
373
374The following are unsupported due to the lack of mapping data.
375
376  '8'  - arabic8, greek8, hebrew8, kana8, thai8, and turkish8
377  '15' - japanese15, korean15, and roi15
378
379=item Cyrillic encoding ISO-IR-111
380
381Anton Tagunov doubts its usefulness.
382
383=item ISO-8859-8-1 [Hebrew]
384
385None of the Encode team knows Hebrew enough (ISO-8859-8, cp1255 and
386MacHebrew are supported because and just because there were mappings
387available at L<http://www.unicode.org/>).  Contributions welcome.
388
389=item ISIRI 3342, Iran System, ISIRI 2900 [Farsi]
390
391Ditto.
392
393=item Thai encoding TCVN
394
395Ditto.
396
397=item Vietnamese encodings VPS
398
399Though Jungshik Shin has reported that Mozilla supports this encoding,
400it was too late before 5.8.0 for us to add it.  In the future, it
401may be available via a separate module.  See
402L<http://lxr.mozilla.org/seamonkey/source/intl/uconv/ucvlatin/vps.uf>
403and
404L<http://lxr.mozilla.org/seamonkey/source/intl/uconv/ucvlatin/vps.ut>
405if you are interested in helping us.
406
407=item Various Mac encodings
408
409The following are unsupported due to the lack of mapping data.
410
411  MacArmenian,  MacBengali,   MacBurmese,   MacEthiopic
412  MacExtArabic, MacGeorgian,  MacKannada,   MacKhmer
413  MacLaotian,   MacMalayalam, MacMongolian, MacOriya
414  MacSinhalese, MacTamil,     MacTelugu,    MacTibetan
415  MacVietnamese
416
417The rest which are already available are based upon the vendor mappings
418at L<http://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/> .
419
420=item (Mac) Indic encodings
421
422The maps for the following are available at L<http://www.unicode.org/>
423but remain unsupported because those encodings need an algorithmical
424approach, currently unsupported by F<enc2xs>:
425
426  MacDevanagari
427  MacGurmukhi
428  MacGujarati
429
430For details, please see C<Unicode mapping issues and notes:> at
431L<http://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/DEVANAGA.TXT> .
432
433I believe this issue is prevalent not only for Mac Indics but also in
434other Indic encodings, but the above were the only Indic encodings
435maps that I could find at L<http://www.unicode.org/> .
436
437=back
438
439=head1 Encoding vs. Charset -- terminology
440
441We are used to using the term (character) I<encoding> and I<character
442set> interchangeably.  But just as confusing the terms byte and
443character is dangerous and the terms should be differentiated when
444needed, we need to differentiate I<encoding> and I<character set>.
445
446To understand that, here is a description of how we make computers
447grok our characters.
448
449=over 2
450
451=item *
452
453First we start with which characters to include.  We call this
454collection of characters I<character repertoire>.
455
456=item *
457
458Then we have to give each character a unique ID so your computer can
459tell the difference between 'a' and 'A'.  This itemized character
460repertoire is now a I<character set>.
461
462=item *
463
464If your computer can grow the character set without further
465processing, you can go ahead and use it.  This is called a I<coded
466character set> (CCS) or I<raw character encoding>.  ASCII is used this
467way for most cases.
468
469=item *
470
471But in many cases, especially multi-byte CJK encodings, you have to
472tweak a little more.  Your network connection may not accept any data
473with the Most Significant Bit set, and your computer may not be able to
474tell if a given byte is a whole character or just half of it.  So you
475have to I<encode> the character set to use it.
476
477A I<character encoding scheme> (CES) determines how to encode a given
478character set, or a set of multiple character sets.  7bit ISO-2022 is
479an example of a CES.  You switch between character sets via I<escape
480sequences>.
481
482=back
483
484Technically, or mathematically, speaking, a character set encoded in
485such a CES that maps character by character may form a CCS.  EUC is such
486an example.  The CES of EUC is as follows:
487
488=over 2
489
490=item *
491
492Map ASCII unchanged.
493
494=item *
495
496Map such a character set that consists of 94 or 96 powered by N
497members by adding 0x80 to each byte.
498
499=item *
500
501You can also use 0x8e and 0x8f to indicate that the following sequence of
502characters belongs to yet another character set.  To each following byte
503is added the value 0x80.
504
505=back
506
507By carefully looking at the encoded byte sequence, you can find that the
508byte sequence conforms a unique number.  In that sense, EUC is a CCS
509generated by a CES above from up to four CCS (complicated?).  UTF-8
510falls into this category.  See L<perlUnicode/"UTF-8"> to find out how
511UTF-8 maps Unicode to a byte sequence.
512
513You may also have found out by now why 7bit ISO-2022 cannot comprise
514a CCS.  If you look at a byte sequence \x21\x21, you can't tell if
515it is two !'s or IDEOGRAPHIC SPACE.  EUC maps the latter to \xA1\xA1
516so you have no trouble differentiating between "!!". and S<"  ">.
517
518=head1 Encoding Classification (by Anton Tagunov and Dan Kogai)
519
520This section tries to classify the supported encodings by their
521applicability for information exchange over the Internet and to
522choose the most suitable aliases to name them in the context of
523such communication.
524
525=over 2
526
527=item *
528
529To (en|de)code encodings marked by C<(**)>, you need
530C<Encode::HanExtra>, available from CPAN.
531
532=back
533
534Encoding names
535
536  US-ASCII    UTF-8    ISO-8859-*  KOI8-R
537  Shift_JIS   EUC-JP   ISO-2022-JP ISO-2022-JP-1
538  EUC-KR      Big5     GB2312
539
540are registered with IANA as preferred MIME names and may
541be used over the Internet.
542
543C<Shift_JIS> has been officialized by JIS X 0208:1997.
544L<Microsoft-related naming mess> gives details.
545
546C<GB2312> is the IANA name for C<EUC-CN>.
547See L<Microsoft-related naming mess> for details.
548
549C<GB_2312-80> I<raw> encoding is available as C<gb2312-raw>
550with Encode. See L<Encode::CN> for details.
551
552  EUC-CN
553  KOI8-U        [RFC2319]
554
555have not been registered with IANA (as of March 2002) but
556seem to be supported by major web browsers.
557The IANA name for C<EUC-CN> is C<GB2312>.
558
559  KS_C_5601-1987
560
561is heavily misused.
562See L<Microsoft-related naming mess> for details.
563
564C<KS_C_5601-1987> I<raw> encoding is available as C<kcs5601-raw>
565with Encode. See L<Encode::KR> for details.
566
567  UTF-16 UTF-16BE UTF-16LE
568
569are IANA-registered C<charset>s. See [RFC 2781] for details.
570Jungshik Shin reports that UTF-16 with a BOM is well accepted
571by MS IE 5/6 and NS 4/6. Beware however that
572
573=over 2
574
575=item *
576
577C<UTF-16> support in any software you're going to be
578using/interoperating with has probably been less tested
579then C<UTF-8> support
580
581=item *
582
583C<UTF-8> coded data seamlessly passes traditional
584command piping (C<cat>, C<more>, etc.) while C<UTF-16> coded
585data is likely to cause confusion (with its zero bytes,
586for example)
587
588=item *
589
590it is beyond the power of words to describe the way HTML browsers
591encode non-C<ASCII> form data. To get a general impression, visit
592L<http://www.alanflavell.org.uk/charset/form-i18n.html>.
593While encoding of form data has stabilized for C<UTF-8> encoded pages
594(at least IE 5/6, NS 6, and Opera 6 behave consistently), be sure to
595expect fun (and cross-browser discrepancies) with C<UTF-16> encoded
596pages!
597
598=back
599
600The rule of thumb is to use C<UTF-8> unless you know what
601you're doing and unless you really benefit from using C<UTF-16>.
602
603  ISO-IR-165    [RFC1345]
604  VISCII
605  GB 12345
606  GB 18030 (**)  (see links below)
607  EUC-TW   (**)
608
609are totally valid encodings but not registered at IANA.
610The names under which they are listed here are probably the
611most widely-known names for these encodings and are recommended
612names.
613
614  BIG5PLUS (**)
615
616is a proprietary name.
617
618=head2 Microsoft-related naming mess
619
620Microsoft products misuse the following names:
621
622=over 2
623
624=item KS_C_5601-1987
625
626Microsoft extension to C<EUC-KR>.
627
628Proper names: C<CP949>, C<UHC>, C<x-windows-949> (as used by Mozilla).
629
630See L<http://lists.w3.org/Archives/Public/ietf-charsets/2001AprJun/0033.html>
631for details.
632
633Encode aliases C<KS_C_5601-1987> to C<cp949> to reflect this common
634misusage. I<Raw> C<KS_C_5601-1987> encoding is available as
635C<kcs5601-raw>.
636
637See L<Encode::KR> for details.
638
639=item GB2312
640
641Microsoft extension to C<EUC-CN>.
642
643Proper names: C<CP936>, C<GBK>.
644
645C<GB2312> has been registered in the C<EUC-CN> meaning at
646IANA. This has partially repaired the situation: Microsoft's
647C<GB2312> has become a superset of the official C<GB2312>.
648
649Encode aliases C<GB2312> to C<euc-cn> in full agreement with
650IANA registration. C<cp936> is supported separately.
651I<Raw> C<GB_2312-80> encoding is available as C<gb2312-raw>.
652
653See L<Encode::CN> for details.
654
655=item Big5
656
657Microsoft extension to C<Big5>.
658
659Proper name: C<CP950>.
660
661Encode separately supports C<Big5> and C<cp950>.
662
663=item Shift_JIS
664
665Microsoft's understanding of C<Shift_JIS>.
666
667JIS has not endorsed the full Microsoft standard however.
668The official C<Shift_JIS> includes only JIS X 0201 and JIS X 0208
669character sets, while Microsoft has always used C<Shift_JIS>
670to encode a wider character repertoire. See C<IANA> registration for
671C<Windows-31J>.
672
673As a historical predecessor, Microsoft's variant
674probably has more rights for the name, though it may be objected
675that Microsoft shouldn't have used JIS as part of the name
676in the first place.
677
678Unambiguous name: C<CP932>. C<IANA> name (also used by Mozilla, and
679provided as an alias by Encode): C<Windows-31J>.
680
681Encode separately supports C<Shift_JIS> and C<cp932>.
682
683=back
684
685=head1 Glossary
686
687=over 2
688
689=item character repertoire
690
691A collection of unique characters.  A I<character> set in the strictest
692sense. At this stage, characters are not numbered.
693
694=item coded character set (CCS)
695
696A character set that is mapped in a way computers can use directly.
697Many character encodings, including EUC, fall in this category.
698
699=item character encoding scheme (CES)
700
701An algorithm to map a character set to a byte sequence.  You don't
702have to be able to tell which character set a given byte sequence
703belongs.  7-bit ISO-2022 is a CES but it cannot be a CCS.  EUC is an
704example of being both a CCS and CES.
705
706=item charset (in MIME context)
707
708has long been used in the meaning of C<encoding>, CES.
709
710While the word combination C<character set> has lost this meaning
711in MIME context since [RFC 2130], the C<charset> abbreviation has
712retained it. This is how [RFC 2277] and [RFC 2278] bless C<charset>:
713
714 This document uses the term "charset" to mean a set of rules for
715 mapping from a sequence of octets to a sequence of characters, such
716 as the combination of a coded character set and a character encoding
717 scheme; this is also what is used as an identifier in MIME "charset="
718 parameters, and registered in the IANA charset registry ...  (Note
719 that this is NOT a term used by other standards bodies, such as ISO).
720 [RFC 2277]
721
722=item EUC
723
724Extended Unix Character.  See ISO-2022.
725
726=item ISO-2022
727
728A CES that was carefully designed to coexist with ASCII.  There are a 7
729bit version and an 8 bit version.
730
731The 7 bit version switches character set via escape sequence so it
732cannot form a CCS.  Since this is more difficult to handle in programs
733than the 8 bit version, the 7 bit version is not very popular except for
734iso-2022-jp, the I<de facto> standard CES for e-mails.
735
736The 8 bit version can form a CCS.  EUC and ISO-8859 are two examples
737thereof.  Pre-5.6 perl could use them as string literals.
738
739=item UCS
740
741Short for I<Universal Character Set>.  When you say just UCS, it means
742I<Unicode>.
743
744=item UCS-2
745
746ISO/IEC 10646 encoding form: Universal Character Set coded in two
747octets.
748
749=item Unicode
750
751A character set that aims to include all character repertoires of the
752world.  Many character sets in various national as well as industrial
753standards have become, in a way, just subsets of Unicode.
754
755=item UTF
756
757Short for I<Unicode Transformation Format>.  Determines how to map a
758Unicode character into a byte sequence.
759
760=item UTF-16
761
762A UTF in 16-bit encoding.  Can either be in big endian or little
763endian.  The big endian version is called UTF-16BE (equal to UCS-2 +
764surrogate support) and the little endian version is called UTF-16LE.
765
766=back
767
768=head1 See Also
769
770L<Encode>,
771L<Encode::Byte>,
772L<Encode::CN>, L<Encode::JP>, L<Encode::KR>, L<Encode::TW>,
773L<Encode::EBCDIC>, L<Encode::Symbol>
774L<Encode::MIME::Header>, L<Encode::Guess>
775
776=head1 References
777
778=over 2
779
780=item ECMA
781
782European Computer Manufacturers Association
783L<http://www.ecma.ch>
784
785=over 2
786
787=item ECMA-035 (eq C<ISO-2022>)
788
789L<http://www.ecma.ch/ecma1/STAND/ECMA-035.HTM>
790
791The specification of ISO-2022 is available from the link above.
792
793=back
794
795=item IANA
796
797Internet Assigned Numbers Authority
798L<http://www.iana.org/>
799
800=over 2
801
802=item Assigned Charset Names by IANA
803
804L<http://www.iana.org/assignments/character-sets>
805
806Most of the C<canonical names> in Encode derive from this list
807so you can directly apply the string you have extracted from MIME
808header of mails and web pages.
809
810=back
811
812=item ISO
813
814International Organization for Standardization
815L<http://www.iso.ch/>
816
817=item RFC
818
819Request For Comments -- need I say more?
820L<http://www.rfc-editor.org/>, L<http://www.ietf.org/rfc.html>,
821L<http://www.faqs.org/rfcs/>
822
823=item UC
824
825Unicode Consortium
826L<http://www.unicode.org/>
827
828=over 2
829
830=item Unicode Glossary
831
832L<http://www.unicode.org/glossary/>
833
834The glossary of this document is based upon this site.
835
836=back
837
838=back
839
840=head2 Other Notable Sites
841
842=over 2
843
844=item czyborra.com
845
846L<http://czyborra.com/>
847
848Contains a lot of useful information, especially gory details of ISO
849vs. vendor mappings.
850
851=item CJK.inf
852
853L<http://examples.oreilly.com/cjkvinfo/doc/cjk.inf>
854
855Somewhat obsolete (last update in 1996), but still useful.  Also try
856
857L<ftp://ftp.oreilly.com/pub/examples/nutshell/cjkv/pdf/GB18030_Summary.pdf>
858
859You will find brief info on C<EUC-CN>, C<GBK> and mostly on C<GB 18030>.
860
861=item Jungshik Shin's Hangul FAQ
862
863L<http://jshin.net/faq>
864
865And especially its subject 8.
866
867L<http://jshin.net/faq/qa8.html>
868
869A comprehensive overview of the Korean (C<KS *>) standards.
870
871=item debian.org: "Introduction to i18n"
872
873A brief description for most of the mentioned CJK encodings is
874contained in
875L<http://www.debian.org/doc/manuals/intro-i18n/ch-codes.en.html>
876
877=back
878
879=head2 Offline sources
880
881=over 2
882
883=item C<CJKV Information Processing> by Ken Lunde
884
885CJKV Information Processing
8861999 O'Reilly & Associates, ISBN : 1-56592-224-7
887
888The modern successor of C<CJK.inf>.
889
890Features a comprehensive coverage of CJKV character sets and
891encodings along with many other issues faced by anyone trying
892to better support CJKV languages/scripts in all the areas of
893information processing.
894
895To purchase this book, visit
896L<http://oreilly.com/catalog/9780596514471/>
897or your favourite bookstore.
898
899=back
900
901=cut
902