1 /* utfebcdic.h 2 * 3 * Copyright (C) 2001, 2002, 2003, 2005, 2006, 2007, 2009, 4 * 2010, 2011 by Larry Wall, Nick Ing-Simmons, and others 5 * 6 * You may distribute under the terms of either the GNU General Public 7 * License or the Artistic License, as specified in the README file. 8 * 9 * Macros to implement UTF-EBCDIC as perl's internal encoding 10 * Adapted from version 7.1 of Unicode Technical Report #16: 11 * http://www.unicode.org/reports/tr16 12 * 13 * To summarize, the way it works is: 14 * To convert an EBCDIC code point to UTF-EBCDIC: 15 * 1) convert to Unicode. No conversion is necesary for code points above 16 * 255, as Unicode and EBCDIC are identical in this range. For smaller 17 * code points, the conversion is done by lookup in the PL_e2a table (with 18 * inverse PL_a2e) in the generated file 'ebcdic_tables.h'. The 'a' 19 * stands for ASCII platform, meaning 0-255 Unicode. Use 20 * NATIVE_TO_LATIN1() and LATIN1_TO_NATIVE(), respectively to perform this 21 * lookup. NATIVE_TO_UNI() and UNI_TO_NATIVE() are similarly used for any 22 * input, and know to avoid the lookup for inputs above 255. 23 * 2) convert that to a utf8-like string called I8 ('I' stands for 24 * intermediate) with variant characters occupying multiple bytes. This 25 * step is similar to the utf8-creating step from Unicode, but the details 26 * are different. This transformation is called UTF8-Mod. There is a 27 * chart about the bit patterns in a comment later in this file. But 28 * essentially here are the differences: 29 * UTF8 I8 30 * invariant byte starts with 0 starts with 0 or 100 31 * continuation byte starts with 10 starts with 101 32 * start byte same in both: if the code point requires N bytes, 33 * then the leading N bits are 1, followed by a 0. If 34 * all 8 bits in the first byte are 1, the code point 35 * will occupy 14 bytes (compared to 13 in Perl's 36 * extended UTF-8). This is incompatible with what 37 * tr16 implies should be the representation of code 38 * points 2**30 and above, but allows Perl to be able 39 * to represent all code points that fit in a 64-bit 40 * word in either our extended UTF-EBCDIC or UTF-8. 41 * 3) Use the algorithm in tr16 to convert each byte from step 2 into 42 * final UTF-EBCDIC. This is done by table lookup from a table 43 * constructed from the algorithm, reproduced in ebcdic_tables.h as 44 * PL_utf2e, with its inverse being PL_e2utf. They are constructed so that 45 * all EBCDIC invariants remain invariant, but no others do, and the first 46 * byte of a variant will always have its upper bit set. But note that 47 * the upper bit of some invariants is also 1. The table also is designed 48 * so that lexically comparing two UTF-EBCDIC-variant characters yields 49 * the Unicode code point order. (To get native code point order, one has 50 * to convert the latin1-range characters to their native code point 51 * value.) The macros NATIVE_UTF8_TO_I8() and I8_TO_NATIVE_UTF8() do the 52 * table lookups. 53 * 54 * For example, the ordinal value of 'A' is 193 in EBCDIC, and also is 193 in 55 * UTF-EBCDIC. Step 1) converts it to 65, Step 2 leaves it at 65, and Step 3 56 * converts it back to 193. As an example of how a variant character works, 57 * take LATIN SMALL LETTER Y WITH DIAERESIS, which is typically 0xDF in 58 * EBCDIC. Step 1 converts it to the Unicode value, 0xFF. Step 2 converts 59 * that to two bytes = 11000111 10111111 = C7 BF, and Step 3 converts those to 60 * 0x8B 0x73. 61 * 62 * If you're starting from Unicode, skip step 1. For UTF-EBCDIC to straight 63 * EBCDIC, reverse the steps. 64 * 65 * The EBCDIC invariants have been chosen to be those characters whose Unicode 66 * equivalents have ordinal numbers less than 160, that is the same characters 67 * that are expressible in ASCII, plus the C1 controls. So there are 160 68 * invariants instead of the 128 in UTF-8. 69 * 70 * The purpose of Step 3 is to make the encoding be invariant for the chosen 71 * characters. This messes up the convenient patterns found in step 2, so 72 * generally, one has to undo step 3 into a temporary to use them, using the 73 * macro NATIVE_TO_I8(). However, one "shadow", or parallel table, 74 * PL_utf8skip, has been constructed that doesn't require undoing things. It 75 * is such that for each byte, it says how long the sequence is if that 76 * (UTF-EBCDIC) byte were to begin it 77 * 78 * There are actually 3 slightly different UTF-EBCDIC encodings in 79 * ebcdic_tables.h, one for each of the code pages recognized by Perl. That 80 * means that there are actually three different sets of tables, one for each 81 * code page. (If Perl is compiled on platforms using another EBCDIC code 82 * page, it may not compile, or Perl may silently mistake it for one of the 83 * three.) 84 * 85 * Note that tr16 actually only specifies one version of UTF-EBCDIC, based on 86 * the 1047 encoding, and which is supposed to be used for all code pages. 87 * But this doesn't work. To illustrate the problem, consider the '^' character. 88 * On a 037 code page it is the single byte 176, whereas under 1047 UTF-EBCDIC 89 * it is the single byte 95. If Perl implemented tr16 exactly, it would mean 90 * that changing a string containing '^' to UTF-EBCDIC would change that '^' 91 * from 176 to 95 (and vice-versa), violating the rule that ASCII-range 92 * characters are the same in UTF-8 or not. Much code in Perl assumes this 93 * rule. See for example 94 * http://grokbase.com/t/perl/mvs/025xf0yhmn/utf-ebcdic-for-posix-bc-malformed-utf-8-character 95 * What Perl does is create a version of UTF-EBCDIC suited to each code page; 96 * the one for the 1047 code page is identical to what's specified in tr16. 97 * This complicates interchanging files between computers using different code 98 * pages. Best is to convert to I8 before sending them, as the I8 99 * representation is the same no matter what the underlying code page is. 100 * 101 * Because of the way UTF-EBCDIC is constructed, the lowest 32 code points that 102 * aren't equivalent to ASCII characters nor C1 controls form the set of 103 * continuation bytes; the remaining 64 non-ASCII, non-control code points form 104 * the potential start bytes, in order. (However, the first 5 of these lead to 105 * malformed overlongs, so there really are only 59 start bytes, and the first 106 * three of the 59 are the start bytes for the Latin1 range.) Hence the 107 * UTF-EBCDIC for the smallest variant code point, 0x160, will have likely 0x41 108 * as its continuation byte, provided 0x41 isn't an ASCII or C1 equivalent. 109 * And its start byte will be the code point that is 37 (32+5) non-ASCII, 110 * non-control code points past it. (0 - 3F are controls, and 40 is SPACE, 111 * leaving 41 as the first potentially available one.) In contrast, on ASCII 112 * platforms, the first 64 (not 32) non-ASCII code points are the continuation 113 * bytes. And the first 2 (not 5) potential start bytes form overlong 114 * malformed sequences. 115 * 116 * EBCDIC characters above 0xFF are the same as Unicode in Perl's 117 * implementation of all 3 encodings, so for those Step 1 is trivial. 118 * 119 * (Note that the entries for invariant characters are necessarily the same in 120 * PL_e2a and PL_e2utf; likewise for their inverses.) 121 * 122 * UTF-EBCDIC strings are the same length or longer than UTF-8 representations 123 * of the same string. The maximum code point representable as 2 bytes in 124 * UTF-EBCDIC is 0x3FFF, instead of 0x7FFF in UTF-8. 125 */ 126 127 START_EXTERN_C 128 129 #include "ebcdic_tables.h" 130 131 END_EXTERN_C 132 133 /* EBCDIC-happy ways of converting native code to UTF-8 */ 134 135 /* Use these when ch is known to be < 256 */ 136 #define NATIVE_TO_LATIN1(ch) (__ASSERT_(FITS_IN_8_BITS(ch)) PL_e2a[(U8)(ch)]) 137 #define LATIN1_TO_NATIVE(ch) (__ASSERT_(FITS_IN_8_BITS(ch)) PL_a2e[(U8)(ch)]) 138 139 /* Use these on bytes */ 140 #define NATIVE_UTF8_TO_I8(b) (__ASSERT_(FITS_IN_8_BITS(b)) PL_e2utf[(U8)(b)]) 141 #define I8_TO_NATIVE_UTF8(b) (__ASSERT_(FITS_IN_8_BITS(b)) PL_utf2e[(U8)(b)]) 142 143 /* Transforms in wide UV chars */ 144 #define NATIVE_TO_UNI(ch) \ 145 (FITS_IN_8_BITS(ch) ? NATIVE_TO_LATIN1(ch) : (UV) (ch)) 146 #define UNI_TO_NATIVE(ch) \ 147 (FITS_IN_8_BITS(ch) ? LATIN1_TO_NATIVE(ch) : (UV) (ch)) 148 /* 149 The following table is adapted from tr16, it shows the I8 encoding of Unicode code points. 150 151 Unicode U32 Bit pattern 1st Byte 2nd Byte 3rd Byte 4th Byte 5th Byte 6th Byte 7th Byte 152 U+0000..U+007F 000000000xxxxxxx 0xxxxxxx 153 U+0080..U+009F 00000000100xxxxx 100xxxxx 154 U+00A0..U+03FF 000000yyyyyxxxxx 110yyyyy 101xxxxx 155 U+0400..U+3FFF 00zzzzyyyyyxxxxx 1110zzzz 101yyyyy 101xxxxx 156 U+4000..U+3FFFF 0wwwzzzzzyyyyyxxxxx 11110www 101zzzzz 101yyyyy 101xxxxx 157 U+40000..U+3FFFFF 0vvwwwwwzzzzzyyyyyxxxxx 111110vv 101wwwww 101zzzzz 101yyyyy 101xxxxx 158 U+400000..U+3FFFFFF 0uvvvvvwwwwwzzzzzyyyyyxxxxx 1111110u 101vvvvv 101wwwww 101zzzzz 101yyyyy 101xxxxx 159 U+4000000..U+3FFFFFFF 00uuuuuvvvvvwwwwwzzzzzyyyyyxxxxx 11111110 101uuuuu 101vvvvv 101wwwww 101zzzzz 101yyyyy 101xxxxx 160 161 Beyond this, Perl uses an incompatible extension, similar to the one used in 162 regular UTF-8. There are now 14 bytes. A full 32 bits of information thus looks like this: 163 1st Byte 2nd-7th 8th Byte 9th Byte 10th B 11th B 12th B 13th B 14th B 164 U+40000000..U+FFFFFFFF ttuuuuuvvvvvwwwwwzzzzzyyyyyxxxxx 11111111 10100000 101000tt 101uuuuu 101vvvvv 101wwwww 101zzzzz 101yyyyy 101xxxxx 165 166 For 32-bit words, the 2nd through 7th bytes effectively function as leading 167 zeros. Above 32 bits, these fill up, with each byte yielding 5 bits of 168 information, so that with 13 continuation bytes, we can handle 65 bits, just 169 above what a 64 bit word can hold 170 171 The following table gives the I8: 172 173 I8 Code Points 1st Byte 2nd Byte 3rd 4th 5th 6th 7th 8th 9th-14th 174 175 0x0000..0x009F 00..9F 176 0x00A0..0x00FF * C5..C7 A0..BF 177 U+0100..U+03FF C8..DF A0..BF 178 U+0400..U+3FFF * E1..EF A0..BF A0..BF 179 U+4000..U+7FFF F0 * B0..BF A0..BF A0..BF 180 U+8000..U+D7FF F1 A0..B5 A0..BF A0..BF 181 U+D800..U+DFFF F1 B6..B7 A0..BF A0..BF (surrogates) 182 U+E000..U+FFFF F1 B8..BF A0..BF A0..BF 183 U+10000..U+3FFFF F2..F7 A0..BF A0..BF A0..BF 184 U+40000..U+FFFFF F8 * A8..BF A0..BF A0..BF A0..BF 185 U+100000..U+10FFFF F9 A0..A1 A0..BF A0..BF A0..BF 186 Below are above-Unicode code points 187 U+110000..U+1FFFFF F9 A2..BF A0..BF A0..BF A0..BF 188 U+200000..U+3FFFFF FA..FB A0..BF A0..BF A0..BF A0..BF 189 U+400000..U+1FFFFFF FC * A4..BF A0..BF A0..BF A0..BF A0..BF 190 U+2000000..U+3FFFFFF FD A0..BF A0..BF A0..BF A0..BF A0..BF 191 U+4000000..U+3FFFFFFF FE * A2..BF A0..BF A0..BF A0..BF A0..BF A0..BF 192 U+40000000.. FF A0..BF A0..BF A0..BF A0..BF A0..BF A0..BF * A1..BF A0..BF 193 194 Note the gaps before several of the byte entries above marked by '*'. These are 195 caused by legal UTF-8 avoiding non-shortest encodings: it is technically 196 possible to UTF-8-encode a single code point in different ways, but that is 197 explicitly forbidden, and the shortest possible encoding should always be used 198 (and that is what Perl does). */ 199 200 #define UTF_CONTINUATION_BYTE_INFO_BITS UTF_EBCDIC_CONTINUATION_BYTE_INFO_BITS 201 202 /* ^? is defined to be APC on EBCDIC systems, as specified in Unicode Technical 203 * Report #16. See the definition of toCTRL() for more */ 204 #define QUESTION_MARK_CTRL LATIN1_TO_NATIVE(0x9F) 205 206 /* 207 * ex: set ts=8 sts=4 sw=4 et: 208 */ 209