1*ae771770SStanislav Sedov 2*ae771770SStanislav Sedov 3*ae771770SStanislav Sedov 4*ae771770SStanislav Sedov 5*ae771770SStanislav Sedov 6*ae771770SStanislav Sedov 7*ae771770SStanislav SedovNetwork Working Group A. Costello 8*ae771770SStanislav SedovRequest for Comments: 3492 Univ. of California, Berkeley 9*ae771770SStanislav SedovCategory: Standards Track March 2003 10*ae771770SStanislav Sedov 11*ae771770SStanislav Sedov 12*ae771770SStanislav Sedov Punycode: A Bootstring encoding of Unicode 13*ae771770SStanislav Sedov for Internationalized Domain Names in Applications (IDNA) 14*ae771770SStanislav Sedov 15*ae771770SStanislav SedovStatus of this Memo 16*ae771770SStanislav Sedov 17*ae771770SStanislav Sedov This document specifies an Internet standards track protocol for the 18*ae771770SStanislav Sedov Internet community, and requests discussion and suggestions for 19*ae771770SStanislav Sedov improvements. Please refer to the current edition of the "Internet 20*ae771770SStanislav Sedov Official Protocol Standards" (STD 1) for the standardization state 21*ae771770SStanislav Sedov and status of this protocol. Distribution of this memo is unlimited. 22*ae771770SStanislav Sedov 23*ae771770SStanislav SedovCopyright Notice 24*ae771770SStanislav Sedov 25*ae771770SStanislav Sedov Copyright (C) The Internet Society (2003). All Rights Reserved. 26*ae771770SStanislav Sedov 27*ae771770SStanislav SedovAbstract 28*ae771770SStanislav Sedov 29*ae771770SStanislav Sedov Punycode is a simple and efficient transfer encoding syntax designed 30*ae771770SStanislav Sedov for use with Internationalized Domain Names in Applications (IDNA). 31*ae771770SStanislav Sedov It uniquely and reversibly transforms a Unicode string into an ASCII 32*ae771770SStanislav Sedov string. ASCII characters in the Unicode string are represented 33*ae771770SStanislav Sedov literally, and non-ASCII characters are represented by ASCII 34*ae771770SStanislav Sedov characters that are allowed in host name labels (letters, digits, and 35*ae771770SStanislav Sedov hyphens). This document defines a general algorithm called 36*ae771770SStanislav Sedov Bootstring that allows a string of basic code points to uniquely 37*ae771770SStanislav Sedov represent any string of code points drawn from a larger set. 38*ae771770SStanislav Sedov Punycode is an instance of Bootstring that uses particular parameter 39*ae771770SStanislav Sedov values specified by this document, appropriate for IDNA. 40*ae771770SStanislav Sedov 41*ae771770SStanislav SedovTable of Contents 42*ae771770SStanislav Sedov 43*ae771770SStanislav Sedov 1. Introduction...............................................2 44*ae771770SStanislav Sedov 1.1 Features..............................................2 45*ae771770SStanislav Sedov 1.2 Interaction of protocol parts.........................3 46*ae771770SStanislav Sedov 2. Terminology................................................3 47*ae771770SStanislav Sedov 3. Bootstring description.....................................4 48*ae771770SStanislav Sedov 3.1 Basic code point segregation..........................4 49*ae771770SStanislav Sedov 3.2 Insertion unsort coding...............................4 50*ae771770SStanislav Sedov 3.3 Generalized variable-length integers..................5 51*ae771770SStanislav Sedov 3.4 Bias adaptation.......................................7 52*ae771770SStanislav Sedov 4. Bootstring parameters......................................8 53*ae771770SStanislav Sedov 5. Parameter values for Punycode..............................8 54*ae771770SStanislav Sedov 6. Bootstring algorithms......................................9 55*ae771770SStanislav Sedov 56*ae771770SStanislav Sedov 57*ae771770SStanislav Sedov 58*ae771770SStanislav SedovCostello Standards Track [Page 1] 59*ae771770SStanislav Sedov 60*ae771770SStanislav SedovRFC 3492 IDNA Punycode March 2003 61*ae771770SStanislav Sedov 62*ae771770SStanislav Sedov 63*ae771770SStanislav Sedov 6.1 Bias adaptation function.............................10 64*ae771770SStanislav Sedov 6.2 Decoding procedure...................................11 65*ae771770SStanislav Sedov 6.3 Encoding procedure...................................12 66*ae771770SStanislav Sedov 6.4 Overflow handling....................................13 67*ae771770SStanislav Sedov 7. Punycode examples.........................................14 68*ae771770SStanislav Sedov 7.1 Sample strings.......................................14 69*ae771770SStanislav Sedov 7.2 Decoding traces......................................17 70*ae771770SStanislav Sedov 7.3 Encoding traces......................................19 71*ae771770SStanislav Sedov 8. Security Considerations...................................20 72*ae771770SStanislav Sedov 9. References................................................21 73*ae771770SStanislav Sedov 9.1 Normative References.................................21 74*ae771770SStanislav Sedov 9.2 Informative References...............................21 75*ae771770SStanislav Sedov A. Mixed-case annotation.....................................22 76*ae771770SStanislav Sedov B. Disclaimer and license....................................22 77*ae771770SStanislav Sedov C. Punycode sample implementation............................23 78*ae771770SStanislav Sedov Author's Address.............................................34 79*ae771770SStanislav Sedov Full Copyright Statement.....................................35 80*ae771770SStanislav Sedov 81*ae771770SStanislav Sedov1. Introduction 82*ae771770SStanislav Sedov 83*ae771770SStanislav Sedov [IDNA] describes an architecture for supporting internationalized 84*ae771770SStanislav Sedov domain names. Labels containing non-ASCII characters can be 85*ae771770SStanislav Sedov represented by ACE labels, which begin with a special ACE prefix and 86*ae771770SStanislav Sedov contain only ASCII characters. The remainder of the label after the 87*ae771770SStanislav Sedov prefix is a Punycode encoding of a Unicode string satisfying certain 88*ae771770SStanislav Sedov constraints. For the details of the prefix and constraints, see 89*ae771770SStanislav Sedov [IDNA] and [NAMEPREP]. 90*ae771770SStanislav Sedov 91*ae771770SStanislav Sedov Punycode is an instance of a more general algorithm called 92*ae771770SStanislav Sedov Bootstring, which allows strings composed from a small set of "basic" 93*ae771770SStanislav Sedov code points to uniquely represent any string of code points drawn 94*ae771770SStanislav Sedov from a larger set. Punycode is Bootstring with particular parameter 95*ae771770SStanislav Sedov values appropriate for IDNA. 96*ae771770SStanislav Sedov 97*ae771770SStanislav Sedov1.1 Features 98*ae771770SStanislav Sedov 99*ae771770SStanislav Sedov Bootstring has been designed to have the following features: 100*ae771770SStanislav Sedov 101*ae771770SStanislav Sedov * Completeness: Every extended string (sequence of arbitrary code 102*ae771770SStanislav Sedov points) can be represented by a basic string (sequence of basic 103*ae771770SStanislav Sedov code points). Restrictions on what strings are allowed, and on 104*ae771770SStanislav Sedov length, can be imposed by higher layers. 105*ae771770SStanislav Sedov 106*ae771770SStanislav Sedov * Uniqueness: There is at most one basic string that represents a 107*ae771770SStanislav Sedov given extended string. 108*ae771770SStanislav Sedov 109*ae771770SStanislav Sedov * Reversibility: Any extended string mapped to a basic string can 110*ae771770SStanislav Sedov be recovered from that basic string. 111*ae771770SStanislav Sedov 112*ae771770SStanislav Sedov 113*ae771770SStanislav Sedov 114*ae771770SStanislav SedovCostello Standards Track [Page 2] 115*ae771770SStanislav Sedov 116*ae771770SStanislav SedovRFC 3492 IDNA Punycode March 2003 117*ae771770SStanislav Sedov 118*ae771770SStanislav Sedov 119*ae771770SStanislav Sedov * Efficient encoding: The ratio of basic string length to extended 120*ae771770SStanislav Sedov string length is small. This is important in the context of 121*ae771770SStanislav Sedov domain names because RFC 1034 [RFC1034] restricts the length of a 122*ae771770SStanislav Sedov domain label to 63 characters. 123*ae771770SStanislav Sedov 124*ae771770SStanislav Sedov * Simplicity: The encoding and decoding algorithms are reasonably 125*ae771770SStanislav Sedov simple to implement. The goals of efficiency and simplicity are 126*ae771770SStanislav Sedov at odds; Bootstring aims at a good balance between them. 127*ae771770SStanislav Sedov 128*ae771770SStanislav Sedov * Readability: Basic code points appearing in the extended string 129*ae771770SStanislav Sedov are represented as themselves in the basic string (although the 130*ae771770SStanislav Sedov main purpose is to improve efficiency, not readability). 131*ae771770SStanislav Sedov 132*ae771770SStanislav Sedov Punycode can also support an additional feature that is not used by 133*ae771770SStanislav Sedov the ToASCII and ToUnicode operations of [IDNA]. When extended 134*ae771770SStanislav Sedov strings are case-folded prior to encoding, the basic string can use 135*ae771770SStanislav Sedov mixed case to tell how to convert the folded string into a mixed-case 136*ae771770SStanislav Sedov string. See appendix A "Mixed-case annotation". 137*ae771770SStanislav Sedov 138*ae771770SStanislav Sedov1.2 Interaction of protocol parts 139*ae771770SStanislav Sedov 140*ae771770SStanislav Sedov Punycode is used by the IDNA protocol [IDNA] for converting domain 141*ae771770SStanislav Sedov labels into ASCII; it is not designed for any other purpose. It is 142*ae771770SStanislav Sedov explicitly not designed for processing arbitrary free text. 143*ae771770SStanislav Sedov 144*ae771770SStanislav Sedov2. Terminology 145*ae771770SStanislav Sedov 146*ae771770SStanislav Sedov The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 147*ae771770SStanislav Sedov "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 148*ae771770SStanislav Sedov document are to be interpreted as described in BCP 14, RFC 2119 149*ae771770SStanislav Sedov [RFC2119]. 150*ae771770SStanislav Sedov 151*ae771770SStanislav Sedov A code point is an integral value associated with a character in a 152*ae771770SStanislav Sedov coded character set. 153*ae771770SStanislav Sedov 154*ae771770SStanislav Sedov As in the Unicode Standard [UNICODE], Unicode code points are denoted 155*ae771770SStanislav Sedov by "U+" followed by four to six hexadecimal digits, while a range of 156*ae771770SStanislav Sedov code points is denoted by two hexadecimal numbers separated by "..", 157*ae771770SStanislav Sedov with no prefixes. 158*ae771770SStanislav Sedov 159*ae771770SStanislav Sedov The operators div and mod perform integer division; (x div y) is the 160*ae771770SStanislav Sedov quotient of x divided by y, discarding the remainder, and (x mod y) 161*ae771770SStanislav Sedov is the remainder, so (x div y) * y + (x mod y) == x. Bootstring uses 162*ae771770SStanislav Sedov these operators only with nonnegative operands, so the quotient and 163*ae771770SStanislav Sedov remainder are always nonnegative. 164*ae771770SStanislav Sedov 165*ae771770SStanislav Sedov The break statement jumps out of the innermost loop (as in C). 166*ae771770SStanislav Sedov 167*ae771770SStanislav Sedov 168*ae771770SStanislav Sedov 169*ae771770SStanislav Sedov 170*ae771770SStanislav SedovCostello Standards Track [Page 3] 171*ae771770SStanislav Sedov 172*ae771770SStanislav SedovRFC 3492 IDNA Punycode March 2003 173*ae771770SStanislav Sedov 174*ae771770SStanislav Sedov 175*ae771770SStanislav Sedov An overflow is an attempt to compute a value that exceeds the maximum 176*ae771770SStanislav Sedov value of an integer variable. 177*ae771770SStanislav Sedov 178*ae771770SStanislav Sedov3. Bootstring description 179*ae771770SStanislav Sedov 180*ae771770SStanislav Sedov Bootstring represents an arbitrary sequence of code points (the 181*ae771770SStanislav Sedov "extended string") as a sequence of basic code points (the "basic 182*ae771770SStanislav Sedov string"). This section describes the representation. Section 6 183*ae771770SStanislav Sedov "Bootstring algorithms" presents the algorithms as pseudocode. 184*ae771770SStanislav Sedov Sections 7.1 "Decoding traces" and 7.2 "Encoding traces" trace the 185*ae771770SStanislav Sedov algorithms for sample inputs. 186*ae771770SStanislav Sedov 187*ae771770SStanislav Sedov The following sections describe the four techniques used in 188*ae771770SStanislav Sedov Bootstring. "Basic code point segregation" is a very simple and 189*ae771770SStanislav Sedov efficient encoding for basic code points occurring in the extended 190*ae771770SStanislav Sedov string: they are simply copied all at once. "Insertion unsort 191*ae771770SStanislav Sedov coding" encodes the non-basic code points as deltas, and processes 192*ae771770SStanislav Sedov the code points in numerical order rather than in order of 193*ae771770SStanislav Sedov appearance, which typically results in smaller deltas. The deltas 194*ae771770SStanislav Sedov are represented as "generalized variable-length integers", which use 195*ae771770SStanislav Sedov basic code points to represent nonnegative integers. The parameters 196*ae771770SStanislav Sedov of this integer representation are dynamically adjusted using "bias 197*ae771770SStanislav Sedov adaptation", to improve efficiency when consecutive deltas have 198*ae771770SStanislav Sedov similar magnitudes. 199*ae771770SStanislav Sedov 200*ae771770SStanislav Sedov3.1 Basic code point segregation 201*ae771770SStanislav Sedov 202*ae771770SStanislav Sedov All basic code points appearing in the extended string are 203*ae771770SStanislav Sedov represented literally at the beginning of the basic string, in their 204*ae771770SStanislav Sedov original order, followed by a delimiter if (and only if) the number 205*ae771770SStanislav Sedov of basic code points is nonzero. The delimiter is a particular basic 206*ae771770SStanislav Sedov code point, which never appears in the remainder of the basic string. 207*ae771770SStanislav Sedov The decoder can therefore find the end of the literal portion (if 208*ae771770SStanislav Sedov there is one) by scanning for the last delimiter. 209*ae771770SStanislav Sedov 210*ae771770SStanislav Sedov3.2 Insertion unsort coding 211*ae771770SStanislav Sedov 212*ae771770SStanislav Sedov The remainder of the basic string (after the last delimiter if there 213*ae771770SStanislav Sedov is one) represents a sequence of nonnegative integral deltas as 214*ae771770SStanislav Sedov generalized variable-length integers, described in section 3.3. The 215*ae771770SStanislav Sedov meaning of the deltas is best understood in terms of the decoder. 216*ae771770SStanislav Sedov 217*ae771770SStanislav Sedov The decoder builds the extended string incrementally. Initially, the 218*ae771770SStanislav Sedov extended string is a copy of the literal portion of the basic string 219*ae771770SStanislav Sedov (excluding the last delimiter). The decoder inserts non-basic code 220*ae771770SStanislav Sedov points, one for each delta, into the extended string, ultimately 221*ae771770SStanislav Sedov arriving at the final decoded string. 222*ae771770SStanislav Sedov 223*ae771770SStanislav Sedov 224*ae771770SStanislav Sedov 225*ae771770SStanislav Sedov 226*ae771770SStanislav SedovCostello Standards Track [Page 4] 227*ae771770SStanislav Sedov 228*ae771770SStanislav SedovRFC 3492 IDNA Punycode March 2003 229*ae771770SStanislav Sedov 230*ae771770SStanislav Sedov 231*ae771770SStanislav Sedov At the heart of this process is a state machine with two state 232*ae771770SStanislav Sedov variables: an index i and a counter n. The index i refers to a 233*ae771770SStanislav Sedov position in the extended string; it ranges from 0 (the first 234*ae771770SStanislav Sedov position) to the current length of the extended string (which refers 235*ae771770SStanislav Sedov to a potential position beyond the current end). If the current 236*ae771770SStanislav Sedov state is <n,i>, the next state is <n,i+1> if i is less than the 237*ae771770SStanislav Sedov length of the extended string, or <n+1,0> if i equals the length of 238*ae771770SStanislav Sedov the extended string. In other words, each state change causes i to 239*ae771770SStanislav Sedov increment, wrapping around to zero if necessary, and n counts the 240*ae771770SStanislav Sedov number of wrap-arounds. 241*ae771770SStanislav Sedov 242*ae771770SStanislav Sedov Notice that the state always advances monotonically (there is no way 243*ae771770SStanislav Sedov for the decoder to return to an earlier state). At each state, an 244*ae771770SStanislav Sedov insertion is either performed or not performed. At most one 245*ae771770SStanislav Sedov insertion is performed in a given state. An insertion inserts the 246*ae771770SStanislav Sedov value of n at position i in the extended string. The deltas are a 247*ae771770SStanislav Sedov run-length encoding of this sequence of events: they are the lengths 248*ae771770SStanislav Sedov of the runs of non-insertion states preceeding the insertion states. 249*ae771770SStanislav Sedov Hence, for each delta, the decoder performs delta state changes, then 250*ae771770SStanislav Sedov an insertion, and then one more state change. (An implementation 251*ae771770SStanislav Sedov need not perform each state change individually, but can instead use 252*ae771770SStanislav Sedov division and remainder calculations to compute the next insertion 253*ae771770SStanislav Sedov state directly.) It is an error if the inserted code point is a 254*ae771770SStanislav Sedov basic code point (because basic code points were supposed to be 255*ae771770SStanislav Sedov segregated as described in section 3.1). 256*ae771770SStanislav Sedov 257*ae771770SStanislav Sedov The encoder's main task is to derive the sequence of deltas that will 258*ae771770SStanislav Sedov cause the decoder to construct the desired string. It can do this by 259*ae771770SStanislav Sedov repeatedly scanning the extended string for the next code point that 260*ae771770SStanislav Sedov the decoder would need to insert, and counting the number of state 261*ae771770SStanislav Sedov changes the decoder would need to perform, mindful of the fact that 262*ae771770SStanislav Sedov the decoder's extended string will include only those code points 263*ae771770SStanislav Sedov that have already been inserted. Section 6.3 "Encoding procedure" 264*ae771770SStanislav Sedov gives a precise algorithm. 265*ae771770SStanislav Sedov 266*ae771770SStanislav Sedov3.3 Generalized variable-length integers 267*ae771770SStanislav Sedov 268*ae771770SStanislav Sedov In a conventional integer representation the base is the number of 269*ae771770SStanislav Sedov distinct symbols for digits, whose values are 0 through base-1. Let 270*ae771770SStanislav Sedov digit_0 denote the least significant digit, digit_1 the next least 271*ae771770SStanislav Sedov significant, and so on. The value represented is the sum over j of 272*ae771770SStanislav Sedov digit_j * w(j), where w(j) = base^j is the weight (scale factor) for 273*ae771770SStanislav Sedov position j. For example, in the base 8 integer 437, the digits are 274*ae771770SStanislav Sedov 7, 3, and 4, and the weights are 1, 8, and 64, so the value is 7 + 275*ae771770SStanislav Sedov 3*8 + 4*64 = 287. This representation has two disadvantages: First, 276*ae771770SStanislav Sedov there are multiple encodings of each value (because there can be 277*ae771770SStanislav Sedov extra zeros in the most significant positions), which is inconvenient 278*ae771770SStanislav Sedov 279*ae771770SStanislav Sedov 280*ae771770SStanislav Sedov 281*ae771770SStanislav Sedov 282*ae771770SStanislav SedovCostello Standards Track [Page 5] 283*ae771770SStanislav Sedov 284*ae771770SStanislav SedovRFC 3492 IDNA Punycode March 2003 285*ae771770SStanislav Sedov 286*ae771770SStanislav Sedov 287*ae771770SStanislav Sedov when unique encodings are needed. Second, the integer is not self- 288*ae771770SStanislav Sedov delimiting, so if multiple integers are concatenated the boundaries 289*ae771770SStanislav Sedov between them are lost. 290*ae771770SStanislav Sedov 291*ae771770SStanislav Sedov The generalized variable-length representation solves these two 292*ae771770SStanislav Sedov problems. The digit values are still 0 through base-1, but now the 293*ae771770SStanislav Sedov integer is self-delimiting by means of thresholds t(j), each of which 294*ae771770SStanislav Sedov is in the range 0 through base-1. Exactly one digit, the most 295*ae771770SStanislav Sedov significant, satisfies digit_j < t(j). Therefore, if several 296*ae771770SStanislav Sedov integers are concatenated, it is easy to separate them, starting with 297*ae771770SStanislav Sedov the first if they are little-endian (least significant digit first), 298*ae771770SStanislav Sedov or starting with the last if they are big-endian (most significant 299*ae771770SStanislav Sedov digit first). As before, the value is the sum over j of digit_j * 300*ae771770SStanislav Sedov w(j), but the weights are different: 301*ae771770SStanislav Sedov 302*ae771770SStanislav Sedov w(0) = 1 303*ae771770SStanislav Sedov w(j) = w(j-1) * (base - t(j-1)) for j > 0 304*ae771770SStanislav Sedov 305*ae771770SStanislav Sedov For example, consider the little-endian sequence of base 8 digits 306*ae771770SStanislav Sedov 734251... Suppose the thresholds are 2, 3, 5, 5, 5, 5... This 307*ae771770SStanislav Sedov implies that the weights are 1, 1*(8-2) = 6, 6*(8-3) = 30, 30*(8-5) = 308*ae771770SStanislav Sedov 90, 90*(8-5) = 270, and so on. 7 is not less than 2, and 3 is not 309*ae771770SStanislav Sedov less than 3, but 4 is less than 5, so 4 is the last digit. The value 310*ae771770SStanislav Sedov of 734 is 7*1 + 3*6 + 4*30 = 145. The next integer is 251, with 311*ae771770SStanislav Sedov value 2*1 + 5*6 + 1*30 = 62. Decoding this representation is very 312*ae771770SStanislav Sedov similar to decoding a conventional integer: Start with a current 313*ae771770SStanislav Sedov value of N = 0 and a weight w = 1. Fetch the next digit d and 314*ae771770SStanislav Sedov increase N by d * w. If d is less than the current threshold (t) 315*ae771770SStanislav Sedov then stop, otherwise increase w by a factor of (base - t), update t 316*ae771770SStanislav Sedov for the next position, and repeat. 317*ae771770SStanislav Sedov 318*ae771770SStanislav Sedov Encoding this representation is similar to encoding a conventional 319*ae771770SStanislav Sedov integer: If N < t then output one digit for N and stop, otherwise 320*ae771770SStanislav Sedov output the digit for t + ((N - t) mod (base - t)), then replace N 321*ae771770SStanislav Sedov with (N - t) div (base - t), update t for the next position, and 322*ae771770SStanislav Sedov repeat. 323*ae771770SStanislav Sedov 324*ae771770SStanislav Sedov For any particular set of values of t(j), there is exactly one 325*ae771770SStanislav Sedov generalized variable-length representation of each nonnegative 326*ae771770SStanislav Sedov integral value. 327*ae771770SStanislav Sedov 328*ae771770SStanislav Sedov Bootstring uses little-endian ordering so that the deltas can be 329*ae771770SStanislav Sedov separated starting with the first. The t(j) values are defined in 330*ae771770SStanislav Sedov terms of the constants base, tmin, and tmax, and a state variable 331*ae771770SStanislav Sedov called bias: 332*ae771770SStanislav Sedov 333*ae771770SStanislav Sedov t(j) = base * (j + 1) - bias, 334*ae771770SStanislav Sedov clamped to the range tmin through tmax 335*ae771770SStanislav Sedov 336*ae771770SStanislav Sedov 337*ae771770SStanislav Sedov 338*ae771770SStanislav SedovCostello Standards Track [Page 6] 339*ae771770SStanislav Sedov 340*ae771770SStanislav SedovRFC 3492 IDNA Punycode March 2003 341*ae771770SStanislav Sedov 342*ae771770SStanislav Sedov 343*ae771770SStanislav Sedov The clamping means that if the formula yields a value less than tmin 344*ae771770SStanislav Sedov or greater than tmax, then t(j) = tmin or tmax, respectively. (In 345*ae771770SStanislav Sedov the pseudocode in section 6 "Bootstring algorithms", the expression 346*ae771770SStanislav Sedov base * (j + 1) is denoted by k for performance reasons.) These t(j) 347*ae771770SStanislav Sedov values cause the representation to favor integers within a particular 348*ae771770SStanislav Sedov range determined by the bias. 349*ae771770SStanislav Sedov 350*ae771770SStanislav Sedov3.4 Bias adaptation 351*ae771770SStanislav Sedov 352*ae771770SStanislav Sedov After each delta is encoded or decoded, bias is set for the next 353*ae771770SStanislav Sedov delta as follows: 354*ae771770SStanislav Sedov 355*ae771770SStanislav Sedov 1. Delta is scaled in order to avoid overflow in the next step: 356*ae771770SStanislav Sedov 357*ae771770SStanislav Sedov let delta = delta div 2 358*ae771770SStanislav Sedov 359*ae771770SStanislav Sedov But when this is the very first delta, the divisor is not 2, but 360*ae771770SStanislav Sedov instead a constant called damp. This compensates for the fact 361*ae771770SStanislav Sedov that the second delta is usually much smaller than the first. 362*ae771770SStanislav Sedov 363*ae771770SStanislav Sedov 2. Delta is increased to compensate for the fact that the next delta 364*ae771770SStanislav Sedov will be inserting into a longer string: 365*ae771770SStanislav Sedov 366*ae771770SStanislav Sedov let delta = delta + (delta div numpoints) 367*ae771770SStanislav Sedov 368*ae771770SStanislav Sedov numpoints is the total number of code points encoded/decoded so 369*ae771770SStanislav Sedov far (including the one corresponding to this delta itself, and 370*ae771770SStanislav Sedov including the basic code points). 371*ae771770SStanislav Sedov 372*ae771770SStanislav Sedov 3. Delta is repeatedly divided until it falls within a threshold, to 373*ae771770SStanislav Sedov predict the minimum number of digits needed to represent the next 374*ae771770SStanislav Sedov delta: 375*ae771770SStanislav Sedov 376*ae771770SStanislav Sedov while delta > ((base - tmin) * tmax) div 2 377*ae771770SStanislav Sedov do let delta = delta div (base - tmin) 378*ae771770SStanislav Sedov 379*ae771770SStanislav Sedov 4. The bias is set: 380*ae771770SStanislav Sedov 381*ae771770SStanislav Sedov let bias = 382*ae771770SStanislav Sedov (base * the number of divisions performed in step 3) + 383*ae771770SStanislav Sedov (((base - tmin + 1) * delta) div (delta + skew)) 384*ae771770SStanislav Sedov 385*ae771770SStanislav Sedov The motivation for this procedure is that the current delta 386*ae771770SStanislav Sedov provides a hint about the likely size of the next delta, and so 387*ae771770SStanislav Sedov t(j) is set to tmax for the more significant digits starting with 388*ae771770SStanislav Sedov the one expected to be last, tmin for the less significant digits 389*ae771770SStanislav Sedov up through the one expected to be third-last, and somewhere 390*ae771770SStanislav Sedov between tmin and tmax for the digit expected to be second-last 391*ae771770SStanislav Sedov 392*ae771770SStanislav Sedov 393*ae771770SStanislav Sedov 394*ae771770SStanislav SedovCostello Standards Track [Page 7] 395*ae771770SStanislav Sedov 396*ae771770SStanislav SedovRFC 3492 IDNA Punycode March 2003 397*ae771770SStanislav Sedov 398*ae771770SStanislav Sedov 399*ae771770SStanislav Sedov (balancing the hope of the expected-last digit being unnecessary 400*ae771770SStanislav Sedov against the danger of it being insufficient). 401*ae771770SStanislav Sedov 402*ae771770SStanislav Sedov4. Bootstring parameters 403*ae771770SStanislav Sedov 404*ae771770SStanislav Sedov Given a set of basic code points, one needs to be designated as the 405*ae771770SStanislav Sedov delimiter. The base cannot be greater than the number of 406*ae771770SStanislav Sedov distinguishable basic code points remaining. The digit-values in the 407*ae771770SStanislav Sedov range 0 through base-1 need to be associated with distinct non- 408*ae771770SStanislav Sedov delimiter basic code points. In some cases multiple code points need 409*ae771770SStanislav Sedov to have the same digit-value; for example, uppercase and lowercase 410*ae771770SStanislav Sedov versions of the same letter need to be equivalent if basic strings 411*ae771770SStanislav Sedov are case-insensitive. 412*ae771770SStanislav Sedov 413*ae771770SStanislav Sedov The initial value of n cannot be greater than the minimum non-basic 414*ae771770SStanislav Sedov code point that could appear in extended strings. 415*ae771770SStanislav Sedov 416*ae771770SStanislav Sedov The remaining five parameters (tmin, tmax, skew, damp, and the 417*ae771770SStanislav Sedov initial value of bias) need to satisfy the following constraints: 418*ae771770SStanislav Sedov 419*ae771770SStanislav Sedov 0 <= tmin <= tmax <= base-1 420*ae771770SStanislav Sedov skew >= 1 421*ae771770SStanislav Sedov damp >= 2 422*ae771770SStanislav Sedov initial_bias mod base <= base - tmin 423*ae771770SStanislav Sedov 424*ae771770SStanislav Sedov Provided the constraints are satisfied, these five parameters affect 425*ae771770SStanislav Sedov efficiency but not correctness. They are best chosen empirically. 426*ae771770SStanislav Sedov 427*ae771770SStanislav Sedov If support for mixed-case annotation is desired (see appendix A), 428*ae771770SStanislav Sedov make sure that the code points corresponding to 0 through tmax-1 all 429*ae771770SStanislav Sedov have both uppercase and lowercase forms. 430*ae771770SStanislav Sedov 431*ae771770SStanislav Sedov5. Parameter values for Punycode 432*ae771770SStanislav Sedov 433*ae771770SStanislav Sedov Punycode uses the following Bootstring parameter values: 434*ae771770SStanislav Sedov 435*ae771770SStanislav Sedov base = 36 436*ae771770SStanislav Sedov tmin = 1 437*ae771770SStanislav Sedov tmax = 26 438*ae771770SStanislav Sedov skew = 38 439*ae771770SStanislav Sedov damp = 700 440*ae771770SStanislav Sedov initial_bias = 72 441*ae771770SStanislav Sedov initial_n = 128 = 0x80 442*ae771770SStanislav Sedov 443*ae771770SStanislav Sedov Although the only restriction Punycode imposes on the input integers 444*ae771770SStanislav Sedov is that they be nonnegative, these parameters are especially designed 445*ae771770SStanislav Sedov to work well with Unicode [UNICODE] code points, which are integers 446*ae771770SStanislav Sedov in the range 0..10FFFF (but not D800..DFFF, which are reserved for 447*ae771770SStanislav Sedov 448*ae771770SStanislav Sedov 449*ae771770SStanislav Sedov 450*ae771770SStanislav SedovCostello Standards Track [Page 8] 451*ae771770SStanislav Sedov 452*ae771770SStanislav SedovRFC 3492 IDNA Punycode March 2003 453*ae771770SStanislav Sedov 454*ae771770SStanislav Sedov 455*ae771770SStanislav Sedov use by the UTF-16 encoding of Unicode). The basic code points are 456*ae771770SStanislav Sedov the ASCII [ASCII] code points (0..7F), of which U+002D (-) is the 457*ae771770SStanislav Sedov delimiter, and some of the others have digit-values as follows: 458*ae771770SStanislav Sedov 459*ae771770SStanislav Sedov code points digit-values 460*ae771770SStanislav Sedov ------------ ---------------------- 461*ae771770SStanislav Sedov 41..5A (A-Z) = 0 to 25, respectively 462*ae771770SStanislav Sedov 61..7A (a-z) = 0 to 25, respectively 463*ae771770SStanislav Sedov 30..39 (0-9) = 26 to 35, respectively 464*ae771770SStanislav Sedov 465*ae771770SStanislav Sedov Using hyphen-minus as the delimiter implies that the encoded string 466*ae771770SStanislav Sedov can end with a hyphen-minus only if the Unicode string consists 467*ae771770SStanislav Sedov entirely of basic code points, but IDNA forbids such strings from 468*ae771770SStanislav Sedov being encoded. The encoded string can begin with a hyphen-minus, but 469*ae771770SStanislav Sedov IDNA prepends a prefix. Therefore IDNA using Punycode conforms to 470*ae771770SStanislav Sedov the RFC 952 rule that host name labels neither begin nor end with a 471*ae771770SStanislav Sedov hyphen-minus [RFC952]. 472*ae771770SStanislav Sedov 473*ae771770SStanislav Sedov A decoder MUST recognize the letters in both uppercase and lowercase 474*ae771770SStanislav Sedov forms (including mixtures of both forms). An encoder SHOULD output 475*ae771770SStanislav Sedov only uppercase forms or only lowercase forms, unless it uses mixed- 476*ae771770SStanislav Sedov case annotation (see appendix A). 477*ae771770SStanislav Sedov 478*ae771770SStanislav Sedov Presumably most users will not manually write or type encoded strings 479*ae771770SStanislav Sedov (as opposed to cutting and pasting them), but those who do will need 480*ae771770SStanislav Sedov to be alert to the potential visual ambiguity between the following 481*ae771770SStanislav Sedov sets of characters: 482*ae771770SStanislav Sedov 483*ae771770SStanislav Sedov G 6 484*ae771770SStanislav Sedov I l 1 485*ae771770SStanislav Sedov O 0 486*ae771770SStanislav Sedov S 5 487*ae771770SStanislav Sedov U V 488*ae771770SStanislav Sedov Z 2 489*ae771770SStanislav Sedov 490*ae771770SStanislav Sedov Such ambiguities are usually resolved by context, but in a Punycode 491*ae771770SStanislav Sedov encoded string there is no context apparent to humans. 492*ae771770SStanislav Sedov 493*ae771770SStanislav Sedov6. Bootstring algorithms 494*ae771770SStanislav Sedov 495*ae771770SStanislav Sedov Some parts of the pseudocode can be omitted if the parameters satisfy 496*ae771770SStanislav Sedov certain conditions (for which Punycode qualifies). These parts are 497*ae771770SStanislav Sedov enclosed in {braces}, and notes immediately following the pseudocode 498*ae771770SStanislav Sedov explain the conditions under which they can be omitted. 499*ae771770SStanislav Sedov 500*ae771770SStanislav Sedov 501*ae771770SStanislav Sedov 502*ae771770SStanislav Sedov 503*ae771770SStanislav Sedov 504*ae771770SStanislav Sedov 505*ae771770SStanislav Sedov 506*ae771770SStanislav SedovCostello Standards Track [Page 9] 507*ae771770SStanislav Sedov 508*ae771770SStanislav SedovRFC 3492 IDNA Punycode March 2003 509*ae771770SStanislav Sedov 510*ae771770SStanislav Sedov 511*ae771770SStanislav Sedov Formally, code points are integers, and hence the pseudocode assumes 512*ae771770SStanislav Sedov that arithmetic operations can be performed directly on code points. 513*ae771770SStanislav Sedov In some programming languages, explicit conversion between code 514*ae771770SStanislav Sedov points and integers might be necessary. 515*ae771770SStanislav Sedov 516*ae771770SStanislav Sedov6.1 Bias adaptation function 517*ae771770SStanislav Sedov 518*ae771770SStanislav Sedov function adapt(delta,numpoints,firsttime): 519*ae771770SStanislav Sedov if firsttime then let delta = delta div damp 520*ae771770SStanislav Sedov else let delta = delta div 2 521*ae771770SStanislav Sedov let delta = delta + (delta div numpoints) 522*ae771770SStanislav Sedov let k = 0 523*ae771770SStanislav Sedov while delta > ((base - tmin) * tmax) div 2 do begin 524*ae771770SStanislav Sedov let delta = delta div (base - tmin) 525*ae771770SStanislav Sedov let k = k + base 526*ae771770SStanislav Sedov end 527*ae771770SStanislav Sedov return k + (((base - tmin + 1) * delta) div (delta + skew)) 528*ae771770SStanislav Sedov 529*ae771770SStanislav Sedov It does not matter whether the modifications to delta and k inside 530*ae771770SStanislav Sedov adapt() affect variables of the same name inside the 531*ae771770SStanislav Sedov encoding/decoding procedures, because after calling adapt() the 532*ae771770SStanislav Sedov caller does not read those variables before overwriting them. 533*ae771770SStanislav Sedov 534*ae771770SStanislav Sedov 535*ae771770SStanislav Sedov 536*ae771770SStanislav Sedov 537*ae771770SStanislav Sedov 538*ae771770SStanislav Sedov 539*ae771770SStanislav Sedov 540*ae771770SStanislav Sedov 541*ae771770SStanislav Sedov 542*ae771770SStanislav Sedov 543*ae771770SStanislav Sedov 544*ae771770SStanislav Sedov 545*ae771770SStanislav Sedov 546*ae771770SStanislav Sedov 547*ae771770SStanislav Sedov 548*ae771770SStanislav Sedov 549*ae771770SStanislav Sedov 550*ae771770SStanislav Sedov 551*ae771770SStanislav Sedov 552*ae771770SStanislav Sedov 553*ae771770SStanislav Sedov 554*ae771770SStanislav Sedov 555*ae771770SStanislav Sedov 556*ae771770SStanislav Sedov 557*ae771770SStanislav Sedov 558*ae771770SStanislav Sedov 559*ae771770SStanislav Sedov 560*ae771770SStanislav Sedov 561*ae771770SStanislav Sedov 562*ae771770SStanislav SedovCostello Standards Track [Page 10] 563*ae771770SStanislav Sedov 564*ae771770SStanislav SedovRFC 3492 IDNA Punycode March 2003 565*ae771770SStanislav Sedov 566*ae771770SStanislav Sedov 567*ae771770SStanislav Sedov6.2 Decoding procedure 568*ae771770SStanislav Sedov 569*ae771770SStanislav Sedov let n = initial_n 570*ae771770SStanislav Sedov let i = 0 571*ae771770SStanislav Sedov let bias = initial_bias 572*ae771770SStanislav Sedov let output = an empty string indexed from 0 573*ae771770SStanislav Sedov consume all code points before the last delimiter (if there is one) 574*ae771770SStanislav Sedov and copy them to output, fail on any non-basic code point 575*ae771770SStanislav Sedov if more than zero code points were consumed then consume one more 576*ae771770SStanislav Sedov (which will be the last delimiter) 577*ae771770SStanislav Sedov while the input is not exhausted do begin 578*ae771770SStanislav Sedov let oldi = i 579*ae771770SStanislav Sedov let w = 1 580*ae771770SStanislav Sedov for k = base to infinity in steps of base do begin 581*ae771770SStanislav Sedov consume a code point, or fail if there was none to consume 582*ae771770SStanislav Sedov let digit = the code point's digit-value, fail if it has none 583*ae771770SStanislav Sedov let i = i + digit * w, fail on overflow 584*ae771770SStanislav Sedov let t = tmin if k <= bias {+ tmin}, or 585*ae771770SStanislav Sedov tmax if k >= bias + tmax, or k - bias otherwise 586*ae771770SStanislav Sedov if digit < t then break 587*ae771770SStanislav Sedov let w = w * (base - t), fail on overflow 588*ae771770SStanislav Sedov end 589*ae771770SStanislav Sedov let bias = adapt(i - oldi, length(output) + 1, test oldi is 0?) 590*ae771770SStanislav Sedov let n = n + i div (length(output) + 1), fail on overflow 591*ae771770SStanislav Sedov let i = i mod (length(output) + 1) 592*ae771770SStanislav Sedov {if n is a basic code point then fail} 593*ae771770SStanislav Sedov insert n into output at position i 594*ae771770SStanislav Sedov increment i 595*ae771770SStanislav Sedov end 596*ae771770SStanislav Sedov 597*ae771770SStanislav Sedov The full statement enclosed in braces (checking whether n is a basic 598*ae771770SStanislav Sedov code point) can be omitted if initial_n exceeds all basic code points 599*ae771770SStanislav Sedov (which is true for Punycode), because n is never less than initial_n. 600*ae771770SStanislav Sedov 601*ae771770SStanislav Sedov In the assignment of t, where t is clamped to the range tmin through 602*ae771770SStanislav Sedov tmax, "+ tmin" can always be omitted. This makes the clamping 603*ae771770SStanislav Sedov calculation incorrect when bias < k < bias + tmin, but that cannot 604*ae771770SStanislav Sedov happen because of the way bias is computed and because of the 605*ae771770SStanislav Sedov constraints on the parameters. 606*ae771770SStanislav Sedov 607*ae771770SStanislav Sedov Because the decoder state can only advance monotonically, and there 608*ae771770SStanislav Sedov is only one representation of any delta, there is therefore only one 609*ae771770SStanislav Sedov encoded string that can represent a given sequence of integers. The 610*ae771770SStanislav Sedov only error conditions are invalid code points, unexpected end-of- 611*ae771770SStanislav Sedov input, overflow, and basic code points encoded using deltas instead 612*ae771770SStanislav Sedov of appearing literally. If the decoder fails on these errors as 613*ae771770SStanislav Sedov shown above, then it cannot produce the same output for two distinct 614*ae771770SStanislav Sedov inputs. Without this property it would have been necessary to re- 615*ae771770SStanislav Sedov 616*ae771770SStanislav Sedov 617*ae771770SStanislav Sedov 618*ae771770SStanislav SedovCostello Standards Track [Page 11] 619*ae771770SStanislav Sedov 620*ae771770SStanislav SedovRFC 3492 IDNA Punycode March 2003 621*ae771770SStanislav Sedov 622*ae771770SStanislav Sedov 623*ae771770SStanislav Sedov encode the output and verify that it matches the input in order to 624*ae771770SStanislav Sedov guarantee the uniqueness of the encoding. 625*ae771770SStanislav Sedov 626*ae771770SStanislav Sedov6.3 Encoding procedure 627*ae771770SStanislav Sedov 628*ae771770SStanislav Sedov let n = initial_n 629*ae771770SStanislav Sedov let delta = 0 630*ae771770SStanislav Sedov let bias = initial_bias 631*ae771770SStanislav Sedov let h = b = the number of basic code points in the input 632*ae771770SStanislav Sedov copy them to the output in order, followed by a delimiter if b > 0 633*ae771770SStanislav Sedov {if the input contains a non-basic code point < n then fail} 634*ae771770SStanislav Sedov while h < length(input) do begin 635*ae771770SStanislav Sedov let m = the minimum {non-basic} code point >= n in the input 636*ae771770SStanislav Sedov let delta = delta + (m - n) * (h + 1), fail on overflow 637*ae771770SStanislav Sedov let n = m 638*ae771770SStanislav Sedov for each code point c in the input (in order) do begin 639*ae771770SStanislav Sedov if c < n {or c is basic} then increment delta, fail on overflow 640*ae771770SStanislav Sedov if c == n then begin 641*ae771770SStanislav Sedov let q = delta 642*ae771770SStanislav Sedov for k = base to infinity in steps of base do begin 643*ae771770SStanislav Sedov let t = tmin if k <= bias {+ tmin}, or 644*ae771770SStanislav Sedov tmax if k >= bias + tmax, or k - bias otherwise 645*ae771770SStanislav Sedov if q < t then break 646*ae771770SStanislav Sedov output the code point for digit t + ((q - t) mod (base - t)) 647*ae771770SStanislav Sedov let q = (q - t) div (base - t) 648*ae771770SStanislav Sedov end 649*ae771770SStanislav Sedov output the code point for digit q 650*ae771770SStanislav Sedov let bias = adapt(delta, h + 1, test h equals b?) 651*ae771770SStanislav Sedov let delta = 0 652*ae771770SStanislav Sedov increment h 653*ae771770SStanislav Sedov end 654*ae771770SStanislav Sedov end 655*ae771770SStanislav Sedov increment delta and n 656*ae771770SStanislav Sedov end 657*ae771770SStanislav Sedov 658*ae771770SStanislav Sedov The full statement enclosed in braces (checking whether the input 659*ae771770SStanislav Sedov contains a non-basic code point less than n) can be omitted if all 660*ae771770SStanislav Sedov code points less than initial_n are basic code points (which is true 661*ae771770SStanislav Sedov for Punycode if code points are unsigned). 662*ae771770SStanislav Sedov 663*ae771770SStanislav Sedov The brace-enclosed conditions "non-basic" and "or c is basic" can be 664*ae771770SStanislav Sedov omitted if initial_n exceeds all basic code points (which is true for 665*ae771770SStanislav Sedov Punycode), because the code point being tested is never less than 666*ae771770SStanislav Sedov initial_n. 667*ae771770SStanislav Sedov 668*ae771770SStanislav Sedov In the assignment of t, where t is clamped to the range tmin through 669*ae771770SStanislav Sedov tmax, "+ tmin" can always be omitted. This makes the clamping 670*ae771770SStanislav Sedov calculation incorrect when bias < k < bias + tmin, but that cannot 671*ae771770SStanislav Sedov 672*ae771770SStanislav Sedov 673*ae771770SStanislav Sedov 674*ae771770SStanislav SedovCostello Standards Track [Page 12] 675*ae771770SStanislav Sedov 676*ae771770SStanislav SedovRFC 3492 IDNA Punycode March 2003 677*ae771770SStanislav Sedov 678*ae771770SStanislav Sedov 679*ae771770SStanislav Sedov happen because of the way bias is computed and because of the 680*ae771770SStanislav Sedov constraints on the parameters. 681*ae771770SStanislav Sedov 682*ae771770SStanislav Sedov The checks for overflow are necessary to avoid producing invalid 683*ae771770SStanislav Sedov output when the input contains very large values or is very long. 684*ae771770SStanislav Sedov 685*ae771770SStanislav Sedov The increment of delta at the bottom of the outer loop cannot 686*ae771770SStanislav Sedov overflow because delta < length(input) before the increment, and 687*ae771770SStanislav Sedov length(input) is already assumed to be representable. The increment 688*ae771770SStanislav Sedov of n could overflow, but only if h == length(input), in which case 689*ae771770SStanislav Sedov the procedure is finished anyway. 690*ae771770SStanislav Sedov 691*ae771770SStanislav Sedov6.4 Overflow handling 692*ae771770SStanislav Sedov 693*ae771770SStanislav Sedov For IDNA, 26-bit unsigned integers are sufficient to handle all valid 694*ae771770SStanislav Sedov IDNA labels without overflow, because any string that needed a 27-bit 695*ae771770SStanislav Sedov delta would have to exceed either the code point limit (0..10FFFF) or 696*ae771770SStanislav Sedov the label length limit (63 characters). However, overflow handling 697*ae771770SStanislav Sedov is necessary because the inputs are not necessarily valid IDNA 698*ae771770SStanislav Sedov labels. 699*ae771770SStanislav Sedov 700*ae771770SStanislav Sedov If the programming language does not provide overflow detection, the 701*ae771770SStanislav Sedov following technique can be used. Suppose A, B, and C are 702*ae771770SStanislav Sedov representable nonnegative integers and C is nonzero. Then A + B 703*ae771770SStanislav Sedov overflows if and only if B > maxint - A, and A + (B * C) overflows if 704*ae771770SStanislav Sedov and only if B > (maxint - A) div C, where maxint is the greatest 705*ae771770SStanislav Sedov integer for which maxint + 1 cannot be represented. Refer to 706*ae771770SStanislav Sedov appendix C "Punycode sample implementation" for demonstrations of 707*ae771770SStanislav Sedov this technique in the C language. 708*ae771770SStanislav Sedov 709*ae771770SStanislav Sedov The decoding and encoding algorithms shown in sections 6.2 and 6.3 710*ae771770SStanislav Sedov handle overflow by detecting it whenever it happens. Another 711*ae771770SStanislav Sedov approach is to enforce limits on the inputs that prevent overflow 712*ae771770SStanislav Sedov from happening. For example, if the encoder were to verify that no 713*ae771770SStanislav Sedov input code points exceed M and that the input length does not exceed 714*ae771770SStanislav Sedov L, then no delta could ever exceed (M - initial_n) * (L + 1), and 715*ae771770SStanislav Sedov hence no overflow could occur if integer variables were capable of 716*ae771770SStanislav Sedov representing values that large. This prevention approach would 717*ae771770SStanislav Sedov impose more restrictions on the input than the detection approach 718*ae771770SStanislav Sedov does, but might be considered simpler in some programming languages. 719*ae771770SStanislav Sedov 720*ae771770SStanislav Sedov In theory, the decoder could use an analogous approach, limiting the 721*ae771770SStanislav Sedov number of digits in a variable-length integer (that is, limiting the 722*ae771770SStanislav Sedov number of iterations in the innermost loop). However, the number of 723*ae771770SStanislav Sedov digits that suffice to represent a given delta can sometimes 724*ae771770SStanislav Sedov represent much larger deltas (because of the adaptation), and hence 725*ae771770SStanislav Sedov this approach would probably need integers wider than 32 bits. 726*ae771770SStanislav Sedov 727*ae771770SStanislav Sedov 728*ae771770SStanislav Sedov 729*ae771770SStanislav Sedov 730*ae771770SStanislav SedovCostello Standards Track [Page 13] 731*ae771770SStanislav Sedov 732*ae771770SStanislav SedovRFC 3492 IDNA Punycode March 2003 733*ae771770SStanislav Sedov 734*ae771770SStanislav Sedov 735*ae771770SStanislav Sedov Yet another approach for the decoder is to allow overflow to occur, 736*ae771770SStanislav Sedov but to check the final output string by re-encoding it and comparing 737*ae771770SStanislav Sedov to the decoder input. If and only if they do not match (using a 738*ae771770SStanislav Sedov case-insensitive ASCII comparison) overflow has occurred. This 739*ae771770SStanislav Sedov delayed-detection approach would not impose any more restrictions on 740*ae771770SStanislav Sedov the input than the immediate-detection approach does, and might be 741*ae771770SStanislav Sedov considered simpler in some programming languages. 742*ae771770SStanislav Sedov 743*ae771770SStanislav Sedov In fact, if the decoder is used only inside the IDNA ToUnicode 744*ae771770SStanislav Sedov operation [IDNA], then it need not check for overflow at all, because 745*ae771770SStanislav Sedov ToUnicode performs a higher level re-encoding and comparison, and a 746*ae771770SStanislav Sedov mismatch has the same consequence as if the Punycode decoder had 747*ae771770SStanislav Sedov failed. 748*ae771770SStanislav Sedov 749*ae771770SStanislav Sedov7. Punycode examples 750*ae771770SStanislav Sedov 751*ae771770SStanislav Sedov7.1 Sample strings 752*ae771770SStanislav Sedov 753*ae771770SStanislav Sedov In the Punycode encodings below, the ACE prefix is not shown. 754*ae771770SStanislav Sedov Backslashes show where line breaks have been inserted in strings too 755*ae771770SStanislav Sedov long for one line. 756*ae771770SStanislav Sedov 757*ae771770SStanislav Sedov The first several examples are all translations of the sentence "Why 758*ae771770SStanislav Sedov can't they just speak in <language>?" (courtesy of Michael Kaplan's 759*ae771770SStanislav Sedov "provincial" page [PROVINCIAL]). Word breaks and punctuation have 760*ae771770SStanislav Sedov been removed, as is often done in domain names. 761*ae771770SStanislav Sedov 762*ae771770SStanislav Sedov (A) Arabic (Egyptian): 763*ae771770SStanislav Sedov u+0644 u+064A u+0647 u+0645 u+0627 u+0628 u+062A u+0643 u+0644 764*ae771770SStanislav Sedov u+0645 u+0648 u+0634 u+0639 u+0631 u+0628 u+064A u+061F 765*ae771770SStanislav Sedov Punycode: egbpdaj6bu4bxfgehfvwxn 766*ae771770SStanislav Sedov 767*ae771770SStanislav Sedov (B) Chinese (simplified): 768*ae771770SStanislav Sedov u+4ED6 u+4EEC u+4E3A u+4EC0 u+4E48 u+4E0D u+8BF4 u+4E2D u+6587 769*ae771770SStanislav Sedov Punycode: ihqwcrb4cv8a8dqg056pqjye 770*ae771770SStanislav Sedov 771*ae771770SStanislav Sedov (C) Chinese (traditional): 772*ae771770SStanislav Sedov u+4ED6 u+5011 u+7232 u+4EC0 u+9EBD u+4E0D u+8AAA u+4E2D u+6587 773*ae771770SStanislav Sedov Punycode: ihqwctvzc91f659drss3x8bo0yb 774*ae771770SStanislav Sedov 775*ae771770SStanislav Sedov (D) Czech: Pro<ccaron>prost<ecaron>nemluv<iacute><ccaron>esky 776*ae771770SStanislav Sedov U+0050 u+0072 u+006F u+010D u+0070 u+0072 u+006F u+0073 u+0074 777*ae771770SStanislav Sedov u+011B u+006E u+0065 u+006D u+006C u+0075 u+0076 u+00ED u+010D 778*ae771770SStanislav Sedov u+0065 u+0073 u+006B u+0079 779*ae771770SStanislav Sedov Punycode: Proprostnemluvesky-uyb24dma41a 780*ae771770SStanislav Sedov 781*ae771770SStanislav Sedov 782*ae771770SStanislav Sedov 783*ae771770SStanislav Sedov 784*ae771770SStanislav Sedov 785*ae771770SStanislav Sedov 786*ae771770SStanislav SedovCostello Standards Track [Page 14] 787*ae771770SStanislav Sedov 788*ae771770SStanislav SedovRFC 3492 IDNA Punycode March 2003 789*ae771770SStanislav Sedov 790*ae771770SStanislav Sedov 791*ae771770SStanislav Sedov (E) Hebrew: 792*ae771770SStanislav Sedov u+05DC u+05DE u+05D4 u+05D4 u+05DD u+05E4 u+05E9 u+05D5 u+05D8 793*ae771770SStanislav Sedov u+05DC u+05D0 u+05DE u+05D3 u+05D1 u+05E8 u+05D9 u+05DD u+05E2 794*ae771770SStanislav Sedov u+05D1 u+05E8 u+05D9 u+05EA 795*ae771770SStanislav Sedov Punycode: 4dbcagdahymbxekheh6e0a7fei0b 796*ae771770SStanislav Sedov 797*ae771770SStanislav Sedov (F) Hindi (Devanagari): 798*ae771770SStanislav Sedov u+092F u+0939 u+0932 u+094B u+0917 u+0939 u+093F u+0928 u+094D 799*ae771770SStanislav Sedov u+0926 u+0940 u+0915 u+094D u+092F u+094B u+0902 u+0928 u+0939 800*ae771770SStanislav Sedov u+0940 u+0902 u+092C u+094B u+0932 u+0938 u+0915 u+0924 u+0947 801*ae771770SStanislav Sedov u+0939 u+0948 u+0902 802*ae771770SStanislav Sedov Punycode: i1baa7eci9glrd9b2ae1bj0hfcgg6iyaf8o0a1dig0cd 803*ae771770SStanislav Sedov 804*ae771770SStanislav Sedov (G) Japanese (kanji and hiragana): 805*ae771770SStanislav Sedov u+306A u+305C u+307F u+3093 u+306A u+65E5 u+672C u+8A9E u+3092 806*ae771770SStanislav Sedov u+8A71 u+3057 u+3066 u+304F u+308C u+306A u+3044 u+306E u+304B 807*ae771770SStanislav Sedov Punycode: n8jok5ay5dzabd5bym9f0cm5685rrjetr6pdxa 808*ae771770SStanislav Sedov 809*ae771770SStanislav Sedov (H) Korean (Hangul syllables): 810*ae771770SStanislav Sedov u+C138 u+ACC4 u+C758 u+BAA8 u+B4E0 u+C0AC u+B78C u+B4E4 u+C774 811*ae771770SStanislav Sedov u+D55C u+AD6D u+C5B4 u+B97C u+C774 u+D574 u+D55C u+B2E4 u+BA74 812*ae771770SStanislav Sedov u+C5BC u+B9C8 u+B098 u+C88B u+C744 u+AE4C 813*ae771770SStanislav Sedov Punycode: 989aomsvi5e83db1d2a355cv1e0vak1dwrv93d5xbh15a0dt30a5j\ 814*ae771770SStanislav Sedov psd879ccm6fea98c 815*ae771770SStanislav Sedov 816*ae771770SStanislav Sedov (I) Russian (Cyrillic): 817*ae771770SStanislav Sedov U+043F u+043E u+0447 u+0435 u+043C u+0443 u+0436 u+0435 u+043E 818*ae771770SStanislav Sedov u+043D u+0438 u+043D u+0435 u+0433 u+043E u+0432 u+043E u+0440 819*ae771770SStanislav Sedov u+044F u+0442 u+043F u+043E u+0440 u+0443 u+0441 u+0441 u+043A 820*ae771770SStanislav Sedov u+0438 821*ae771770SStanislav Sedov Punycode: b1abfaaepdrnnbgefbaDotcwatmq2g4l 822*ae771770SStanislav Sedov 823*ae771770SStanislav Sedov (J) Spanish: Porqu<eacute>nopuedensimplementehablarenEspa<ntilde>ol 824*ae771770SStanislav Sedov U+0050 u+006F u+0072 u+0071 u+0075 u+00E9 u+006E u+006F u+0070 825*ae771770SStanislav Sedov u+0075 u+0065 u+0064 u+0065 u+006E u+0073 u+0069 u+006D u+0070 826*ae771770SStanislav Sedov u+006C u+0065 u+006D u+0065 u+006E u+0074 u+0065 u+0068 u+0061 827*ae771770SStanislav Sedov u+0062 u+006C u+0061 u+0072 u+0065 u+006E U+0045 u+0073 u+0070 828*ae771770SStanislav Sedov u+0061 u+00F1 u+006F u+006C 829*ae771770SStanislav Sedov Punycode: PorqunopuedensimplementehablarenEspaol-fmd56a 830*ae771770SStanislav Sedov 831*ae771770SStanislav Sedov (K) Vietnamese: 832*ae771770SStanislav Sedov T<adotbelow>isaoh<odotbelow>kh<ocirc>ngth<ecirchookabove>ch\ 833*ae771770SStanislav Sedov <ihookabove>n<oacute>iti<ecircacute>ngVi<ecircdotbelow>t 834*ae771770SStanislav Sedov U+0054 u+1EA1 u+0069 u+0073 u+0061 u+006F u+0068 u+1ECD u+006B 835*ae771770SStanislav Sedov u+0068 u+00F4 u+006E u+0067 u+0074 u+0068 u+1EC3 u+0063 u+0068 836*ae771770SStanislav Sedov u+1EC9 u+006E u+00F3 u+0069 u+0074 u+0069 u+1EBF u+006E u+0067 837*ae771770SStanislav Sedov U+0056 u+0069 u+1EC7 u+0074 838*ae771770SStanislav Sedov Punycode: TisaohkhngthchnitingVit-kjcr8268qyxafd2f1b9g 839*ae771770SStanislav Sedov 840*ae771770SStanislav Sedov 841*ae771770SStanislav Sedov 842*ae771770SStanislav SedovCostello Standards Track [Page 15] 843*ae771770SStanislav Sedov 844*ae771770SStanislav SedovRFC 3492 IDNA Punycode March 2003 845*ae771770SStanislav Sedov 846*ae771770SStanislav Sedov 847*ae771770SStanislav Sedov The next several examples are all names of Japanese music artists, 848*ae771770SStanislav Sedov song titles, and TV programs, just because the author happens to have 849*ae771770SStanislav Sedov them handy (but Japanese is useful for providing examples of single- 850*ae771770SStanislav Sedov row text, two-row text, ideographic text, and various mixtures 851*ae771770SStanislav Sedov thereof). 852*ae771770SStanislav Sedov 853*ae771770SStanislav Sedov (L) 3<nen>B<gumi><kinpachi><sensei> 854*ae771770SStanislav Sedov u+0033 u+5E74 U+0042 u+7D44 u+91D1 u+516B u+5148 u+751F 855*ae771770SStanislav Sedov Punycode: 3B-ww4c5e180e575a65lsy2b 856*ae771770SStanislav Sedov 857*ae771770SStanislav Sedov (M) <amuro><namie>-with-SUPER-MONKEYS 858*ae771770SStanislav Sedov u+5B89 u+5BA4 u+5948 u+7F8E u+6075 u+002D u+0077 u+0069 u+0074 859*ae771770SStanislav Sedov u+0068 u+002D U+0053 U+0055 U+0050 U+0045 U+0052 u+002D U+004D 860*ae771770SStanislav Sedov U+004F U+004E U+004B U+0045 U+0059 U+0053 861*ae771770SStanislav Sedov Punycode: -with-SUPER-MONKEYS-pc58ag80a8qai00g7n9n 862*ae771770SStanislav Sedov 863*ae771770SStanislav Sedov (N) Hello-Another-Way-<sorezore><no><basho> 864*ae771770SStanislav Sedov U+0048 u+0065 u+006C u+006C u+006F u+002D U+0041 u+006E u+006F 865*ae771770SStanislav Sedov u+0074 u+0068 u+0065 u+0072 u+002D U+0057 u+0061 u+0079 u+002D 866*ae771770SStanislav Sedov u+305D u+308C u+305E u+308C u+306E u+5834 u+6240 867*ae771770SStanislav Sedov Punycode: Hello-Another-Way--fc4qua05auwb3674vfr0b 868*ae771770SStanislav Sedov 869*ae771770SStanislav Sedov (O) <hitotsu><yane><no><shita>2 870*ae771770SStanislav Sedov u+3072 u+3068 u+3064 u+5C4B u+6839 u+306E u+4E0B u+0032 871*ae771770SStanislav Sedov Punycode: 2-u9tlzr9756bt3uc0v 872*ae771770SStanislav Sedov 873*ae771770SStanislav Sedov (P) Maji<de>Koi<suru>5<byou><mae> 874*ae771770SStanislav Sedov U+004D u+0061 u+006A u+0069 u+3067 U+004B u+006F u+0069 u+3059 875*ae771770SStanislav Sedov u+308B u+0035 u+79D2 u+524D 876*ae771770SStanislav Sedov Punycode: MajiKoi5-783gue6qz075azm5e 877*ae771770SStanislav Sedov 878*ae771770SStanislav Sedov (Q) <pafii>de<runba> 879*ae771770SStanislav Sedov u+30D1 u+30D5 u+30A3 u+30FC u+0064 u+0065 u+30EB u+30F3 u+30D0 880*ae771770SStanislav Sedov Punycode: de-jg4avhby1noc0d 881*ae771770SStanislav Sedov 882*ae771770SStanislav Sedov (R) <sono><supiido><de> 883*ae771770SStanislav Sedov u+305D u+306E u+30B9 u+30D4 u+30FC u+30C9 u+3067 884*ae771770SStanislav Sedov Punycode: d9juau41awczczp 885*ae771770SStanislav Sedov 886*ae771770SStanislav Sedov The last example is an ASCII string that breaks the existing rules 887*ae771770SStanislav Sedov for host name labels. (It is not a realistic example for IDNA, 888*ae771770SStanislav Sedov because IDNA never encodes pure ASCII labels.) 889*ae771770SStanislav Sedov 890*ae771770SStanislav Sedov (S) -> $1.00 <- 891*ae771770SStanislav Sedov u+002D u+003E u+0020 u+0024 u+0031 u+002E u+0030 u+0030 u+0020 892*ae771770SStanislav Sedov u+003C u+002D 893*ae771770SStanislav Sedov Punycode: -> $1.00 <-- 894*ae771770SStanislav Sedov 895*ae771770SStanislav Sedov 896*ae771770SStanislav Sedov 897*ae771770SStanislav Sedov 898*ae771770SStanislav SedovCostello Standards Track [Page 16] 899*ae771770SStanislav Sedov 900*ae771770SStanislav SedovRFC 3492 IDNA Punycode March 2003 901*ae771770SStanislav Sedov 902*ae771770SStanislav Sedov 903*ae771770SStanislav Sedov7.2 Decoding traces 904*ae771770SStanislav Sedov 905*ae771770SStanislav Sedov In the following traces, the evolving state of the decoder is shown 906*ae771770SStanislav Sedov as a sequence of hexadecimal values, representing the code points in 907*ae771770SStanislav Sedov the extended string. An asterisk appears just after the most 908*ae771770SStanislav Sedov recently inserted code point, indicating both n (the value preceeding 909*ae771770SStanislav Sedov the asterisk) and i (the position of the value just after the 910*ae771770SStanislav Sedov asterisk). Other numerical values are decimal. 911*ae771770SStanislav Sedov 912*ae771770SStanislav Sedov Decoding trace of example B from section 7.1: 913*ae771770SStanislav Sedov 914*ae771770SStanislav Sedov n is 128, i is 0, bias is 72 915*ae771770SStanislav Sedov input is "ihqwcrb4cv8a8dqg056pqjye" 916*ae771770SStanislav Sedov there is no delimiter, so extended string starts empty 917*ae771770SStanislav Sedov delta "ihq" decodes to 19853 918*ae771770SStanislav Sedov bias becomes 21 919*ae771770SStanislav Sedov 4E0D * 920*ae771770SStanislav Sedov delta "wc" decodes to 64 921*ae771770SStanislav Sedov bias becomes 20 922*ae771770SStanislav Sedov 4E0D 4E2D * 923*ae771770SStanislav Sedov delta "rb" decodes to 37 924*ae771770SStanislav Sedov bias becomes 13 925*ae771770SStanislav Sedov 4E3A * 4E0D 4E2D 926*ae771770SStanislav Sedov delta "4c" decodes to 56 927*ae771770SStanislav Sedov bias becomes 17 928*ae771770SStanislav Sedov 4E3A 4E48 * 4E0D 4E2D 929*ae771770SStanislav Sedov delta "v8a" decodes to 599 930*ae771770SStanislav Sedov bias becomes 32 931*ae771770SStanislav Sedov 4E3A 4EC0 * 4E48 4E0D 4E2D 932*ae771770SStanislav Sedov delta "8d" decodes to 130 933*ae771770SStanislav Sedov bias becomes 23 934*ae771770SStanislav Sedov 4ED6 * 4E3A 4EC0 4E48 4E0D 4E2D 935*ae771770SStanislav Sedov delta "qg" decodes to 154 936*ae771770SStanislav Sedov bias becomes 25 937*ae771770SStanislav Sedov 4ED6 4EEC * 4E3A 4EC0 4E48 4E0D 4E2D 938*ae771770SStanislav Sedov delta "056p" decodes to 46301 939*ae771770SStanislav Sedov bias becomes 84 940*ae771770SStanislav Sedov 4ED6 4EEC 4E3A 4EC0 4E48 4E0D 4E2D 6587 * 941*ae771770SStanislav Sedov delta "qjye" decodes to 88531 942*ae771770SStanislav Sedov bias becomes 90 943*ae771770SStanislav Sedov 4ED6 4EEC 4E3A 4EC0 4E48 4E0D 8BF4 * 4E2D 6587 944*ae771770SStanislav Sedov 945*ae771770SStanislav Sedov 946*ae771770SStanislav Sedov 947*ae771770SStanislav Sedov 948*ae771770SStanislav Sedov 949*ae771770SStanislav Sedov 950*ae771770SStanislav Sedov 951*ae771770SStanislav Sedov 952*ae771770SStanislav Sedov 953*ae771770SStanislav Sedov 954*ae771770SStanislav SedovCostello Standards Track [Page 17] 955*ae771770SStanislav Sedov 956*ae771770SStanislav SedovRFC 3492 IDNA Punycode March 2003 957*ae771770SStanislav Sedov 958*ae771770SStanislav Sedov 959*ae771770SStanislav Sedov Decoding trace of example L from section 7.1: 960*ae771770SStanislav Sedov 961*ae771770SStanislav Sedov n is 128, i is 0, bias is 72 962*ae771770SStanislav Sedov input is "3B-ww4c5e180e575a65lsy2b" 963*ae771770SStanislav Sedov literal portion is "3B-", so extended string starts as: 964*ae771770SStanislav Sedov 0033 0042 965*ae771770SStanislav Sedov delta "ww4c" decodes to 62042 966*ae771770SStanislav Sedov bias becomes 27 967*ae771770SStanislav Sedov 0033 0042 5148 * 968*ae771770SStanislav Sedov delta "5e" decodes to 139 969*ae771770SStanislav Sedov bias becomes 24 970*ae771770SStanislav Sedov 0033 0042 516B * 5148 971*ae771770SStanislav Sedov delta "180e" decodes to 16683 972*ae771770SStanislav Sedov bias becomes 67 973*ae771770SStanislav Sedov 0033 5E74 * 0042 516B 5148 974*ae771770SStanislav Sedov delta "575a" decodes to 34821 975*ae771770SStanislav Sedov bias becomes 82 976*ae771770SStanislav Sedov 0033 5E74 0042 516B 5148 751F * 977*ae771770SStanislav Sedov delta "65l" decodes to 14592 978*ae771770SStanislav Sedov bias becomes 67 979*ae771770SStanislav Sedov 0033 5E74 0042 7D44 * 516B 5148 751F 980*ae771770SStanislav Sedov delta "sy2b" decodes to 42088 981*ae771770SStanislav Sedov bias becomes 84 982*ae771770SStanislav Sedov 0033 5E74 0042 7D44 91D1 * 516B 5148 751F 983*ae771770SStanislav Sedov 984*ae771770SStanislav Sedov 985*ae771770SStanislav Sedov 986*ae771770SStanislav Sedov 987*ae771770SStanislav Sedov 988*ae771770SStanislav Sedov 989*ae771770SStanislav Sedov 990*ae771770SStanislav Sedov 991*ae771770SStanislav Sedov 992*ae771770SStanislav Sedov 993*ae771770SStanislav Sedov 994*ae771770SStanislav Sedov 995*ae771770SStanislav Sedov 996*ae771770SStanislav Sedov 997*ae771770SStanislav Sedov 998*ae771770SStanislav Sedov 999*ae771770SStanislav Sedov 1000*ae771770SStanislav Sedov 1001*ae771770SStanislav Sedov 1002*ae771770SStanislav Sedov 1003*ae771770SStanislav Sedov 1004*ae771770SStanislav Sedov 1005*ae771770SStanislav Sedov 1006*ae771770SStanislav Sedov 1007*ae771770SStanislav Sedov 1008*ae771770SStanislav Sedov 1009*ae771770SStanislav Sedov 1010*ae771770SStanislav SedovCostello Standards Track [Page 18] 1011*ae771770SStanislav Sedov 1012*ae771770SStanislav SedovRFC 3492 IDNA Punycode March 2003 1013*ae771770SStanislav Sedov 1014*ae771770SStanislav Sedov 1015*ae771770SStanislav Sedov7.3 Encoding traces 1016*ae771770SStanislav Sedov 1017*ae771770SStanislav Sedov In the following traces, code point values are hexadecimal, while 1018*ae771770SStanislav Sedov other numerical values are decimal. 1019*ae771770SStanislav Sedov 1020*ae771770SStanislav Sedov Encoding trace of example B from section 7.1: 1021*ae771770SStanislav Sedov 1022*ae771770SStanislav Sedov bias is 72 1023*ae771770SStanislav Sedov input is: 1024*ae771770SStanislav Sedov 4ED6 4EEC 4E3A 4EC0 4E48 4E0D 8BF4 4E2D 6587 1025*ae771770SStanislav Sedov there are no basic code points, so no literal portion 1026*ae771770SStanislav Sedov next code point to insert is 4E0D 1027*ae771770SStanislav Sedov needed delta is 19853, encodes as "ihq" 1028*ae771770SStanislav Sedov bias becomes 21 1029*ae771770SStanislav Sedov next code point to insert is 4E2D 1030*ae771770SStanislav Sedov needed delta is 64, encodes as "wc" 1031*ae771770SStanislav Sedov bias becomes 20 1032*ae771770SStanislav Sedov next code point to insert is 4E3A 1033*ae771770SStanislav Sedov needed delta is 37, encodes as "rb" 1034*ae771770SStanislav Sedov bias becomes 13 1035*ae771770SStanislav Sedov next code point to insert is 4E48 1036*ae771770SStanislav Sedov needed delta is 56, encodes as "4c" 1037*ae771770SStanislav Sedov bias becomes 17 1038*ae771770SStanislav Sedov next code point to insert is 4EC0 1039*ae771770SStanislav Sedov needed delta is 599, encodes as "v8a" 1040*ae771770SStanislav Sedov bias becomes 32 1041*ae771770SStanislav Sedov next code point to insert is 4ED6 1042*ae771770SStanislav Sedov needed delta is 130, encodes as "8d" 1043*ae771770SStanislav Sedov bias becomes 23 1044*ae771770SStanislav Sedov next code point to insert is 4EEC 1045*ae771770SStanislav Sedov needed delta is 154, encodes as "qg" 1046*ae771770SStanislav Sedov bias becomes 25 1047*ae771770SStanislav Sedov next code point to insert is 6587 1048*ae771770SStanislav Sedov needed delta is 46301, encodes as "056p" 1049*ae771770SStanislav Sedov bias becomes 84 1050*ae771770SStanislav Sedov next code point to insert is 8BF4 1051*ae771770SStanislav Sedov needed delta is 88531, encodes as "qjye" 1052*ae771770SStanislav Sedov bias becomes 90 1053*ae771770SStanislav Sedov output is "ihqwcrb4cv8a8dqg056pqjye" 1054*ae771770SStanislav Sedov 1055*ae771770SStanislav Sedov 1056*ae771770SStanislav Sedov 1057*ae771770SStanislav Sedov 1058*ae771770SStanislav Sedov 1059*ae771770SStanislav Sedov 1060*ae771770SStanislav Sedov 1061*ae771770SStanislav Sedov 1062*ae771770SStanislav Sedov 1063*ae771770SStanislav Sedov 1064*ae771770SStanislav Sedov 1065*ae771770SStanislav Sedov 1066*ae771770SStanislav SedovCostello Standards Track [Page 19] 1067*ae771770SStanislav Sedov 1068*ae771770SStanislav SedovRFC 3492 IDNA Punycode March 2003 1069*ae771770SStanislav Sedov 1070*ae771770SStanislav Sedov 1071*ae771770SStanislav Sedov Encoding trace of example L from section 7.1: 1072*ae771770SStanislav Sedov 1073*ae771770SStanislav Sedov bias is 72 1074*ae771770SStanislav Sedov input is: 1075*ae771770SStanislav Sedov 0033 5E74 0042 7D44 91D1 516B 5148 751F 1076*ae771770SStanislav Sedov basic code points (0033, 0042) are copied to literal portion: "3B-" 1077*ae771770SStanislav Sedov next code point to insert is 5148 1078*ae771770SStanislav Sedov needed delta is 62042, encodes as "ww4c" 1079*ae771770SStanislav Sedov bias becomes 27 1080*ae771770SStanislav Sedov next code point to insert is 516B 1081*ae771770SStanislav Sedov needed delta is 139, encodes as "5e" 1082*ae771770SStanislav Sedov bias becomes 24 1083*ae771770SStanislav Sedov next code point to insert is 5E74 1084*ae771770SStanislav Sedov needed delta is 16683, encodes as "180e" 1085*ae771770SStanislav Sedov bias becomes 67 1086*ae771770SStanislav Sedov next code point to insert is 751F 1087*ae771770SStanislav Sedov needed delta is 34821, encodes as "575a" 1088*ae771770SStanislav Sedov bias becomes 82 1089*ae771770SStanislav Sedov next code point to insert is 7D44 1090*ae771770SStanislav Sedov needed delta is 14592, encodes as "65l" 1091*ae771770SStanislav Sedov bias becomes 67 1092*ae771770SStanislav Sedov next code point to insert is 91D1 1093*ae771770SStanislav Sedov needed delta is 42088, encodes as "sy2b" 1094*ae771770SStanislav Sedov bias becomes 84 1095*ae771770SStanislav Sedov output is "3B-ww4c5e180e575a65lsy2b" 1096*ae771770SStanislav Sedov 1097*ae771770SStanislav Sedov8. Security Considerations 1098*ae771770SStanislav Sedov 1099*ae771770SStanislav Sedov Users expect each domain name in DNS to be controlled by a single 1100*ae771770SStanislav Sedov authority. If a Unicode string intended for use as a domain label 1101*ae771770SStanislav Sedov could map to multiple ACE labels, then an internationalized domain 1102*ae771770SStanislav Sedov name could map to multiple ASCII domain names, each controlled by a 1103*ae771770SStanislav Sedov different authority, some of which could be spoofs that hijack 1104*ae771770SStanislav Sedov service requests intended for another. Therefore Punycode is 1105*ae771770SStanislav Sedov designed so that each Unicode string has a unique encoding. 1106*ae771770SStanislav Sedov 1107*ae771770SStanislav Sedov However, there can still be multiple Unicode representations of the 1108*ae771770SStanislav Sedov "same" text, for various definitions of "same". This problem is 1109*ae771770SStanislav Sedov addressed to some extent by the Unicode standard under the topic of 1110*ae771770SStanislav Sedov canonicalization, and this work is leveraged for domain names by 1111*ae771770SStanislav Sedov Nameprep [NAMEPREP]. 1112*ae771770SStanislav Sedov 1113*ae771770SStanislav Sedov 1114*ae771770SStanislav Sedov 1115*ae771770SStanislav Sedov 1116*ae771770SStanislav Sedov 1117*ae771770SStanislav Sedov 1118*ae771770SStanislav Sedov 1119*ae771770SStanislav Sedov 1120*ae771770SStanislav Sedov 1121*ae771770SStanislav Sedov 1122*ae771770SStanislav SedovCostello Standards Track [Page 20] 1123*ae771770SStanislav Sedov 1124*ae771770SStanislav SedovRFC 3492 IDNA Punycode March 2003 1125*ae771770SStanislav Sedov 1126*ae771770SStanislav Sedov 1127*ae771770SStanislav Sedov9. References 1128*ae771770SStanislav Sedov 1129*ae771770SStanislav Sedov9.1 Normative References 1130*ae771770SStanislav Sedov 1131*ae771770SStanislav Sedov [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 1132*ae771770SStanislav Sedov Requirement Levels", BCP 14, RFC 2119, March 1997. 1133*ae771770SStanislav Sedov 1134*ae771770SStanislav Sedov9.2 Informative References 1135*ae771770SStanislav Sedov 1136*ae771770SStanislav Sedov [RFC952] Harrenstien, K., Stahl, M. and E. Feinler, "DOD Internet 1137*ae771770SStanislav Sedov Host Table Specification", RFC 952, October 1985. 1138*ae771770SStanislav Sedov 1139*ae771770SStanislav Sedov [RFC1034] Mockapetris, P., "Domain Names - Concepts and 1140*ae771770SStanislav Sedov Facilities", STD 13, RFC 1034, November 1987. 1141*ae771770SStanislav Sedov 1142*ae771770SStanislav Sedov [IDNA] Faltstrom, P., Hoffman, P. and A. Costello, 1143*ae771770SStanislav Sedov "Internationalizing Domain Names in Applications 1144*ae771770SStanislav Sedov (IDNA)", RFC 3490, March 2003. 1145*ae771770SStanislav Sedov 1146*ae771770SStanislav Sedov [NAMEPREP] Hoffman, P. and M. Blanchet, "Nameprep: A Stringprep 1147*ae771770SStanislav Sedov Profile for Internationalized Domain Names (IDN)", RFC 1148*ae771770SStanislav Sedov 3491, March 2003. 1149*ae771770SStanislav Sedov 1150*ae771770SStanislav Sedov [ASCII] Cerf, V., "ASCII format for Network Interchange", RFC 1151*ae771770SStanislav Sedov 20, October 1969. 1152*ae771770SStanislav Sedov 1153*ae771770SStanislav Sedov [PROVINCIAL] Kaplan, M., "The 'anyone can be provincial!' page", 1154*ae771770SStanislav Sedov http://www.trigeminal.com/samples/provincial.html. 1155*ae771770SStanislav Sedov 1156*ae771770SStanislav Sedov [UNICODE] The Unicode Consortium, "The Unicode Standard", 1157*ae771770SStanislav Sedov http://www.unicode.org/unicode/standard/standard.html. 1158*ae771770SStanislav Sedov 1159*ae771770SStanislav Sedov 1160*ae771770SStanislav Sedov 1161*ae771770SStanislav Sedov 1162*ae771770SStanislav Sedov 1163*ae771770SStanislav Sedov 1164*ae771770SStanislav Sedov 1165*ae771770SStanislav Sedov 1166*ae771770SStanislav Sedov 1167*ae771770SStanislav Sedov 1168*ae771770SStanislav Sedov 1169*ae771770SStanislav Sedov 1170*ae771770SStanislav Sedov 1171*ae771770SStanislav Sedov 1172*ae771770SStanislav Sedov 1173*ae771770SStanislav Sedov 1174*ae771770SStanislav Sedov 1175*ae771770SStanislav Sedov 1176*ae771770SStanislav Sedov 1177*ae771770SStanislav Sedov 1178*ae771770SStanislav SedovCostello Standards Track [Page 21] 1179*ae771770SStanislav Sedov 1180*ae771770SStanislav SedovRFC 3492 IDNA Punycode March 2003 1181*ae771770SStanislav Sedov 1182*ae771770SStanislav Sedov 1183*ae771770SStanislav SedovA. Mixed-case annotation 1184*ae771770SStanislav Sedov 1185*ae771770SStanislav Sedov In order to use Punycode to represent case-insensitive strings, 1186*ae771770SStanislav Sedov higher layers need to case-fold the strings prior to Punycode 1187*ae771770SStanislav Sedov encoding. The encoded string can use mixed case as an annotation 1188*ae771770SStanislav Sedov telling how to convert the folded string into a mixed-case string for 1189*ae771770SStanislav Sedov display purposes. Note, however, that mixed-case annotation is not 1190*ae771770SStanislav Sedov used by the ToASCII and ToUnicode operations specified in [IDNA], and 1191*ae771770SStanislav Sedov therefore implementors of IDNA can disregard this appendix. 1192*ae771770SStanislav Sedov 1193*ae771770SStanislav Sedov Basic code points can use mixed case directly, because the decoder 1194*ae771770SStanislav Sedov copies them verbatim, leaving lowercase code points lowercase, and 1195*ae771770SStanislav Sedov leaving uppercase code points uppercase. Each non-basic code point 1196*ae771770SStanislav Sedov is represented by a delta, which is represented by a sequence of 1197*ae771770SStanislav Sedov basic code points, the last of which provides the annotation. If it 1198*ae771770SStanislav Sedov is uppercase, it is a suggestion to map the non-basic code point to 1199*ae771770SStanislav Sedov uppercase (if possible); if it is lowercase, it is a suggestion to 1200*ae771770SStanislav Sedov map the non-basic code point to lowercase (if possible). 1201*ae771770SStanislav Sedov 1202*ae771770SStanislav Sedov These annotations do not alter the code points returned by decoders; 1203*ae771770SStanislav Sedov the annotations are returned separately, for the caller to use or 1204*ae771770SStanislav Sedov ignore. Encoders can accept annotations in addition to code points, 1205*ae771770SStanislav Sedov but the annotations do not alter the output, except to influence the 1206*ae771770SStanislav Sedov uppercase/lowercase form of ASCII letters. 1207*ae771770SStanislav Sedov 1208*ae771770SStanislav Sedov Punycode encoders and decoders need not support these annotations, 1209*ae771770SStanislav Sedov and higher layers need not use them. 1210*ae771770SStanislav Sedov 1211*ae771770SStanislav SedovB. Disclaimer and license 1212*ae771770SStanislav Sedov 1213*ae771770SStanislav Sedov Regarding this entire document or any portion of it (including the 1214*ae771770SStanislav Sedov pseudocode and C code), the author makes no guarantees and is not 1215*ae771770SStanislav Sedov responsible for any damage resulting from its use. The author grants 1216*ae771770SStanislav Sedov irrevocable permission to anyone to use, modify, and distribute it in 1217*ae771770SStanislav Sedov any way that does not diminish the rights of anyone else to use, 1218*ae771770SStanislav Sedov modify, and distribute it, provided that redistributed derivative 1219*ae771770SStanislav Sedov works do not contain misleading author or version information. 1220*ae771770SStanislav Sedov Derivative works need not be licensed under similar terms. 1221*ae771770SStanislav Sedov 1222*ae771770SStanislav Sedov 1223*ae771770SStanislav Sedov 1224*ae771770SStanislav Sedov 1225*ae771770SStanislav Sedov 1226*ae771770SStanislav Sedov 1227*ae771770SStanislav Sedov 1228*ae771770SStanislav Sedov 1229*ae771770SStanislav Sedov 1230*ae771770SStanislav Sedov 1231*ae771770SStanislav Sedov 1232*ae771770SStanislav Sedov 1233*ae771770SStanislav Sedov 1234*ae771770SStanislav SedovCostello Standards Track [Page 22] 1235*ae771770SStanislav Sedov 1236*ae771770SStanislav SedovRFC 3492 IDNA Punycode March 2003 1237*ae771770SStanislav Sedov 1238*ae771770SStanislav Sedov 1239*ae771770SStanislav SedovC. Punycode sample implementation 1240*ae771770SStanislav Sedov 1241*ae771770SStanislav Sedov/* 1242*ae771770SStanislav Sedovpunycode.c from RFC 3492 1243*ae771770SStanislav Sedovhttp://www.nicemice.net/idn/ 1244*ae771770SStanislav SedovAdam M. Costello 1245*ae771770SStanislav Sedovhttp://www.nicemice.net/amc/ 1246*ae771770SStanislav Sedov 1247*ae771770SStanislav SedovThis is ANSI C code (C89) implementing Punycode (RFC 3492). 1248*ae771770SStanislav Sedov 1249*ae771770SStanislav Sedov*/ 1250*ae771770SStanislav Sedov 1251*ae771770SStanislav Sedov 1252*ae771770SStanislav Sedov/************************************************************/ 1253*ae771770SStanislav Sedov/* Public interface (would normally go in its own .h file): */ 1254*ae771770SStanislav Sedov 1255*ae771770SStanislav Sedov#include <limits.h> 1256*ae771770SStanislav Sedov 1257*ae771770SStanislav Sedovenum punycode_status { 1258*ae771770SStanislav Sedov punycode_success, 1259*ae771770SStanislav Sedov punycode_bad_input, /* Input is invalid. */ 1260*ae771770SStanislav Sedov punycode_big_output, /* Output would exceed the space provided. */ 1261*ae771770SStanislav Sedov punycode_overflow /* Input needs wider integers to process. */ 1262*ae771770SStanislav Sedov}; 1263*ae771770SStanislav Sedov 1264*ae771770SStanislav Sedov#if UINT_MAX >= (1 << 26) - 1 1265*ae771770SStanislav Sedovtypedef unsigned int punycode_uint; 1266*ae771770SStanislav Sedov#else 1267*ae771770SStanislav Sedovtypedef unsigned long punycode_uint; 1268*ae771770SStanislav Sedov#endif 1269*ae771770SStanislav Sedov 1270*ae771770SStanislav Sedovenum punycode_status punycode_encode( 1271*ae771770SStanislav Sedov punycode_uint input_length, 1272*ae771770SStanislav Sedov const punycode_uint input[], 1273*ae771770SStanislav Sedov const unsigned char case_flags[], 1274*ae771770SStanislav Sedov punycode_uint *output_length, 1275*ae771770SStanislav Sedov char output[] ); 1276*ae771770SStanislav Sedov 1277*ae771770SStanislav Sedov /* punycode_encode() converts Unicode to Punycode. The input */ 1278*ae771770SStanislav Sedov /* is represented as an array of Unicode code points (not code */ 1279*ae771770SStanislav Sedov /* units; surrogate pairs are not allowed), and the output */ 1280*ae771770SStanislav Sedov /* will be represented as an array of ASCII code points. The */ 1281*ae771770SStanislav Sedov /* output string is *not* null-terminated; it will contain */ 1282*ae771770SStanislav Sedov /* zeros if and only if the input contains zeros. (Of course */ 1283*ae771770SStanislav Sedov /* the caller can leave room for a terminator and add one if */ 1284*ae771770SStanislav Sedov /* needed.) The input_length is the number of code points in */ 1285*ae771770SStanislav Sedov /* the input. The output_length is an in/out argument: the */ 1286*ae771770SStanislav Sedov /* caller passes in the maximum number of code points that it */ 1287*ae771770SStanislav Sedov 1288*ae771770SStanislav Sedov 1289*ae771770SStanislav Sedov 1290*ae771770SStanislav SedovCostello Standards Track [Page 23] 1291*ae771770SStanislav Sedov 1292*ae771770SStanislav SedovRFC 3492 IDNA Punycode March 2003 1293*ae771770SStanislav Sedov 1294*ae771770SStanislav Sedov 1295*ae771770SStanislav Sedov /* can receive, and on successful return it will contain the */ 1296*ae771770SStanislav Sedov /* number of code points actually output. The case_flags array */ 1297*ae771770SStanislav Sedov /* holds input_length boolean values, where nonzero suggests that */ 1298*ae771770SStanislav Sedov /* the corresponding Unicode character be forced to uppercase */ 1299*ae771770SStanislav Sedov /* after being decoded (if possible), and zero suggests that */ 1300*ae771770SStanislav Sedov /* it be forced to lowercase (if possible). ASCII code points */ 1301*ae771770SStanislav Sedov /* are encoded literally, except that ASCII letters are forced */ 1302*ae771770SStanislav Sedov /* to uppercase or lowercase according to the corresponding */ 1303*ae771770SStanislav Sedov /* uppercase flags. If case_flags is a null pointer then ASCII */ 1304*ae771770SStanislav Sedov /* letters are left as they are, and other code points are */ 1305*ae771770SStanislav Sedov /* treated as if their uppercase flags were zero. The return */ 1306*ae771770SStanislav Sedov /* value can be any of the punycode_status values defined above */ 1307*ae771770SStanislav Sedov /* except punycode_bad_input; if not punycode_success, then */ 1308*ae771770SStanislav Sedov /* output_size and output might contain garbage. */ 1309*ae771770SStanislav Sedov 1310*ae771770SStanislav Sedovenum punycode_status punycode_decode( 1311*ae771770SStanislav Sedov punycode_uint input_length, 1312*ae771770SStanislav Sedov const char input[], 1313*ae771770SStanislav Sedov punycode_uint *output_length, 1314*ae771770SStanislav Sedov punycode_uint output[], 1315*ae771770SStanislav Sedov unsigned char case_flags[] ); 1316*ae771770SStanislav Sedov 1317*ae771770SStanislav Sedov /* punycode_decode() converts Punycode to Unicode. The input is */ 1318*ae771770SStanislav Sedov /* represented as an array of ASCII code points, and the output */ 1319*ae771770SStanislav Sedov /* will be represented as an array of Unicode code points. The */ 1320*ae771770SStanislav Sedov /* input_length is the number of code points in the input. The */ 1321*ae771770SStanislav Sedov /* output_length is an in/out argument: the caller passes in */ 1322*ae771770SStanislav Sedov /* the maximum number of code points that it can receive, and */ 1323*ae771770SStanislav Sedov /* on successful return it will contain the actual number of */ 1324*ae771770SStanislav Sedov /* code points output. The case_flags array needs room for at */ 1325*ae771770SStanislav Sedov /* least output_length values, or it can be a null pointer if the */ 1326*ae771770SStanislav Sedov /* case information is not needed. A nonzero flag suggests that */ 1327*ae771770SStanislav Sedov /* the corresponding Unicode character be forced to uppercase */ 1328*ae771770SStanislav Sedov /* by the caller (if possible), while zero suggests that it be */ 1329*ae771770SStanislav Sedov /* forced to lowercase (if possible). ASCII code points are */ 1330*ae771770SStanislav Sedov /* output already in the proper case, but their flags will be set */ 1331*ae771770SStanislav Sedov /* appropriately so that applying the flags would be harmless. */ 1332*ae771770SStanislav Sedov /* The return value can be any of the punycode_status values */ 1333*ae771770SStanislav Sedov /* defined above; if not punycode_success, then output_length, */ 1334*ae771770SStanislav Sedov /* output, and case_flags might contain garbage. On success, the */ 1335*ae771770SStanislav Sedov /* decoder will never need to write an output_length greater than */ 1336*ae771770SStanislav Sedov /* input_length, because of how the encoding is defined. */ 1337*ae771770SStanislav Sedov 1338*ae771770SStanislav Sedov/**********************************************************/ 1339*ae771770SStanislav Sedov/* Implementation (would normally go in its own .c file): */ 1340*ae771770SStanislav Sedov 1341*ae771770SStanislav Sedov#include <string.h> 1342*ae771770SStanislav Sedov 1343*ae771770SStanislav Sedov 1344*ae771770SStanislav Sedov 1345*ae771770SStanislav Sedov 1346*ae771770SStanislav SedovCostello Standards Track [Page 24] 1347*ae771770SStanislav Sedov 1348*ae771770SStanislav SedovRFC 3492 IDNA Punycode March 2003 1349*ae771770SStanislav Sedov 1350*ae771770SStanislav Sedov 1351*ae771770SStanislav Sedov/*** Bootstring parameters for Punycode ***/ 1352*ae771770SStanislav Sedov 1353*ae771770SStanislav Sedovenum { base = 36, tmin = 1, tmax = 26, skew = 38, damp = 700, 1354*ae771770SStanislav Sedov initial_bias = 72, initial_n = 0x80, delimiter = 0x2D }; 1355*ae771770SStanislav Sedov 1356*ae771770SStanislav Sedov/* basic(cp) tests whether cp is a basic code point: */ 1357*ae771770SStanislav Sedov#define basic(cp) ((punycode_uint)(cp) < 0x80) 1358*ae771770SStanislav Sedov 1359*ae771770SStanislav Sedov/* delim(cp) tests whether cp is a delimiter: */ 1360*ae771770SStanislav Sedov#define delim(cp) ((cp) == delimiter) 1361*ae771770SStanislav Sedov 1362*ae771770SStanislav Sedov/* decode_digit(cp) returns the numeric value of a basic code */ 1363*ae771770SStanislav Sedov/* point (for use in representing integers) in the range 0 to */ 1364*ae771770SStanislav Sedov/* base-1, or base if cp is does not represent a value. */ 1365*ae771770SStanislav Sedov 1366*ae771770SStanislav Sedovstatic punycode_uint decode_digit(punycode_uint cp) 1367*ae771770SStanislav Sedov{ 1368*ae771770SStanislav Sedov return cp - 48 < 10 ? cp - 22 : cp - 65 < 26 ? cp - 65 : 1369*ae771770SStanislav Sedov cp - 97 < 26 ? cp - 97 : base; 1370*ae771770SStanislav Sedov} 1371*ae771770SStanislav Sedov 1372*ae771770SStanislav Sedov/* encode_digit(d,flag) returns the basic code point whose value */ 1373*ae771770SStanislav Sedov/* (when used for representing integers) is d, which needs to be in */ 1374*ae771770SStanislav Sedov/* the range 0 to base-1. The lowercase form is used unless flag is */ 1375*ae771770SStanislav Sedov/* nonzero, in which case the uppercase form is used. The behavior */ 1376*ae771770SStanislav Sedov/* is undefined if flag is nonzero and digit d has no uppercase form. */ 1377*ae771770SStanislav Sedov 1378*ae771770SStanislav Sedovstatic char encode_digit(punycode_uint d, int flag) 1379*ae771770SStanislav Sedov{ 1380*ae771770SStanislav Sedov return d + 22 + 75 * (d < 26) - ((flag != 0) << 5); 1381*ae771770SStanislav Sedov /* 0..25 map to ASCII a..z or A..Z */ 1382*ae771770SStanislav Sedov /* 26..35 map to ASCII 0..9 */ 1383*ae771770SStanislav Sedov} 1384*ae771770SStanislav Sedov 1385*ae771770SStanislav Sedov/* flagged(bcp) tests whether a basic code point is flagged */ 1386*ae771770SStanislav Sedov/* (uppercase). The behavior is undefined if bcp is not a */ 1387*ae771770SStanislav Sedov/* basic code point. */ 1388*ae771770SStanislav Sedov 1389*ae771770SStanislav Sedov#define flagged(bcp) ((punycode_uint)(bcp) - 65 < 26) 1390*ae771770SStanislav Sedov 1391*ae771770SStanislav Sedov/* encode_basic(bcp,flag) forces a basic code point to lowercase */ 1392*ae771770SStanislav Sedov/* if flag is zero, uppercase if flag is nonzero, and returns */ 1393*ae771770SStanislav Sedov/* the resulting code point. The code point is unchanged if it */ 1394*ae771770SStanislav Sedov/* is caseless. The behavior is undefined if bcp is not a basic */ 1395*ae771770SStanislav Sedov/* code point. */ 1396*ae771770SStanislav Sedov 1397*ae771770SStanislav Sedovstatic char encode_basic(punycode_uint bcp, int flag) 1398*ae771770SStanislav Sedov{ 1399*ae771770SStanislav Sedov 1400*ae771770SStanislav Sedov 1401*ae771770SStanislav Sedov 1402*ae771770SStanislav SedovCostello Standards Track [Page 25] 1403*ae771770SStanislav Sedov 1404*ae771770SStanislav SedovRFC 3492 IDNA Punycode March 2003 1405*ae771770SStanislav Sedov 1406*ae771770SStanislav Sedov 1407*ae771770SStanislav Sedov bcp -= (bcp - 97 < 26) << 5; 1408*ae771770SStanislav Sedov return bcp + ((!flag && (bcp - 65 < 26)) << 5); 1409*ae771770SStanislav Sedov} 1410*ae771770SStanislav Sedov 1411*ae771770SStanislav Sedov/*** Platform-specific constants ***/ 1412*ae771770SStanislav Sedov 1413*ae771770SStanislav Sedov/* maxint is the maximum value of a punycode_uint variable: */ 1414*ae771770SStanislav Sedovstatic const punycode_uint maxint = -1; 1415*ae771770SStanislav Sedov/* Because maxint is unsigned, -1 becomes the maximum value. */ 1416*ae771770SStanislav Sedov 1417*ae771770SStanislav Sedov/*** Bias adaptation function ***/ 1418*ae771770SStanislav Sedov 1419*ae771770SStanislav Sedovstatic punycode_uint adapt( 1420*ae771770SStanislav Sedov punycode_uint delta, punycode_uint numpoints, int firsttime ) 1421*ae771770SStanislav Sedov{ 1422*ae771770SStanislav Sedov punycode_uint k; 1423*ae771770SStanislav Sedov 1424*ae771770SStanislav Sedov delta = firsttime ? delta / damp : delta >> 1; 1425*ae771770SStanislav Sedov /* delta >> 1 is a faster way of doing delta / 2 */ 1426*ae771770SStanislav Sedov delta += delta / numpoints; 1427*ae771770SStanislav Sedov 1428*ae771770SStanislav Sedov for (k = 0; delta > ((base - tmin) * tmax) / 2; k += base) { 1429*ae771770SStanislav Sedov delta /= base - tmin; 1430*ae771770SStanislav Sedov } 1431*ae771770SStanislav Sedov 1432*ae771770SStanislav Sedov return k + (base - tmin + 1) * delta / (delta + skew); 1433*ae771770SStanislav Sedov} 1434*ae771770SStanislav Sedov 1435*ae771770SStanislav Sedov/*** Main encode function ***/ 1436*ae771770SStanislav Sedov 1437*ae771770SStanislav Sedovenum punycode_status punycode_encode( 1438*ae771770SStanislav Sedov punycode_uint input_length, 1439*ae771770SStanislav Sedov const punycode_uint input[], 1440*ae771770SStanislav Sedov const unsigned char case_flags[], 1441*ae771770SStanislav Sedov punycode_uint *output_length, 1442*ae771770SStanislav Sedov char output[] ) 1443*ae771770SStanislav Sedov{ 1444*ae771770SStanislav Sedov punycode_uint n, delta, h, b, out, max_out, bias, j, m, q, k, t; 1445*ae771770SStanislav Sedov 1446*ae771770SStanislav Sedov /* Initialize the state: */ 1447*ae771770SStanislav Sedov 1448*ae771770SStanislav Sedov n = initial_n; 1449*ae771770SStanislav Sedov delta = out = 0; 1450*ae771770SStanislav Sedov max_out = *output_length; 1451*ae771770SStanislav Sedov bias = initial_bias; 1452*ae771770SStanislav Sedov 1453*ae771770SStanislav Sedov /* Handle the basic code points: */ 1454*ae771770SStanislav Sedov 1455*ae771770SStanislav Sedov 1456*ae771770SStanislav Sedov 1457*ae771770SStanislav Sedov 1458*ae771770SStanislav SedovCostello Standards Track [Page 26] 1459*ae771770SStanislav Sedov 1460*ae771770SStanislav SedovRFC 3492 IDNA Punycode March 2003 1461*ae771770SStanislav Sedov 1462*ae771770SStanislav Sedov 1463*ae771770SStanislav Sedov for (j = 0; j < input_length; ++j) { 1464*ae771770SStanislav Sedov if (basic(input[j])) { 1465*ae771770SStanislav Sedov if (max_out - out < 2) return punycode_big_output; 1466*ae771770SStanislav Sedov output[out++] = 1467*ae771770SStanislav Sedov case_flags ? encode_basic(input[j], case_flags[j]) : input[j]; 1468*ae771770SStanislav Sedov } 1469*ae771770SStanislav Sedov /* else if (input[j] < n) return punycode_bad_input; */ 1470*ae771770SStanislav Sedov /* (not needed for Punycode with unsigned code points) */ 1471*ae771770SStanislav Sedov } 1472*ae771770SStanislav Sedov 1473*ae771770SStanislav Sedov h = b = out; 1474*ae771770SStanislav Sedov 1475*ae771770SStanislav Sedov /* h is the number of code points that have been handled, b is the */ 1476*ae771770SStanislav Sedov /* number of basic code points, and out is the number of characters */ 1477*ae771770SStanislav Sedov /* that have been output. */ 1478*ae771770SStanislav Sedov 1479*ae771770SStanislav Sedov if (b > 0) output[out++] = delimiter; 1480*ae771770SStanislav Sedov 1481*ae771770SStanislav Sedov /* Main encoding loop: */ 1482*ae771770SStanislav Sedov 1483*ae771770SStanislav Sedov while (h < input_length) { 1484*ae771770SStanislav Sedov /* All non-basic code points < n have been */ 1485*ae771770SStanislav Sedov /* handled already. Find the next larger one: */ 1486*ae771770SStanislav Sedov 1487*ae771770SStanislav Sedov for (m = maxint, j = 0; j < input_length; ++j) { 1488*ae771770SStanislav Sedov /* if (basic(input[j])) continue; */ 1489*ae771770SStanislav Sedov /* (not needed for Punycode) */ 1490*ae771770SStanislav Sedov if (input[j] >= n && input[j] < m) m = input[j]; 1491*ae771770SStanislav Sedov } 1492*ae771770SStanislav Sedov 1493*ae771770SStanislav Sedov /* Increase delta enough to advance the decoder's */ 1494*ae771770SStanislav Sedov /* <n,i> state to <m,0>, but guard against overflow: */ 1495*ae771770SStanislav Sedov 1496*ae771770SStanislav Sedov if (m - n > (maxint - delta) / (h + 1)) return punycode_overflow; 1497*ae771770SStanislav Sedov delta += (m - n) * (h + 1); 1498*ae771770SStanislav Sedov n = m; 1499*ae771770SStanislav Sedov 1500*ae771770SStanislav Sedov for (j = 0; j < input_length; ++j) { 1501*ae771770SStanislav Sedov /* Punycode does not need to check whether input[j] is basic: */ 1502*ae771770SStanislav Sedov if (input[j] < n /* || basic(input[j]) */ ) { 1503*ae771770SStanislav Sedov if (++delta == 0) return punycode_overflow; 1504*ae771770SStanislav Sedov } 1505*ae771770SStanislav Sedov 1506*ae771770SStanislav Sedov if (input[j] == n) { 1507*ae771770SStanislav Sedov /* Represent delta as a generalized variable-length integer: */ 1508*ae771770SStanislav Sedov 1509*ae771770SStanislav Sedov for (q = delta, k = base; ; k += base) { 1510*ae771770SStanislav Sedov if (out >= max_out) return punycode_big_output; 1511*ae771770SStanislav Sedov 1512*ae771770SStanislav Sedov 1513*ae771770SStanislav Sedov 1514*ae771770SStanislav SedovCostello Standards Track [Page 27] 1515*ae771770SStanislav Sedov 1516*ae771770SStanislav SedovRFC 3492 IDNA Punycode March 2003 1517*ae771770SStanislav Sedov 1518*ae771770SStanislav Sedov 1519*ae771770SStanislav Sedov t = k <= bias /* + tmin */ ? tmin : /* +tmin not needed */ 1520*ae771770SStanislav Sedov k >= bias + tmax ? tmax : k - bias; 1521*ae771770SStanislav Sedov if (q < t) break; 1522*ae771770SStanislav Sedov output[out++] = encode_digit(t + (q - t) % (base - t), 0); 1523*ae771770SStanislav Sedov q = (q - t) / (base - t); 1524*ae771770SStanislav Sedov } 1525*ae771770SStanislav Sedov 1526*ae771770SStanislav Sedov output[out++] = encode_digit(q, case_flags && case_flags[j]); 1527*ae771770SStanislav Sedov bias = adapt(delta, h + 1, h == b); 1528*ae771770SStanislav Sedov delta = 0; 1529*ae771770SStanislav Sedov ++h; 1530*ae771770SStanislav Sedov } 1531*ae771770SStanislav Sedov } 1532*ae771770SStanislav Sedov 1533*ae771770SStanislav Sedov ++delta, ++n; 1534*ae771770SStanislav Sedov } 1535*ae771770SStanislav Sedov 1536*ae771770SStanislav Sedov *output_length = out; 1537*ae771770SStanislav Sedov return punycode_success; 1538*ae771770SStanislav Sedov} 1539*ae771770SStanislav Sedov 1540*ae771770SStanislav Sedov/*** Main decode function ***/ 1541*ae771770SStanislav Sedov 1542*ae771770SStanislav Sedovenum punycode_status punycode_decode( 1543*ae771770SStanislav Sedov punycode_uint input_length, 1544*ae771770SStanislav Sedov const char input[], 1545*ae771770SStanislav Sedov punycode_uint *output_length, 1546*ae771770SStanislav Sedov punycode_uint output[], 1547*ae771770SStanislav Sedov unsigned char case_flags[] ) 1548*ae771770SStanislav Sedov{ 1549*ae771770SStanislav Sedov punycode_uint n, out, i, max_out, bias, 1550*ae771770SStanislav Sedov b, j, in, oldi, w, k, digit, t; 1551*ae771770SStanislav Sedov 1552*ae771770SStanislav Sedov /* Initialize the state: */ 1553*ae771770SStanislav Sedov 1554*ae771770SStanislav Sedov n = initial_n; 1555*ae771770SStanislav Sedov out = i = 0; 1556*ae771770SStanislav Sedov max_out = *output_length; 1557*ae771770SStanislav Sedov bias = initial_bias; 1558*ae771770SStanislav Sedov 1559*ae771770SStanislav Sedov /* Handle the basic code points: Let b be the number of input code */ 1560*ae771770SStanislav Sedov /* points before the last delimiter, or 0 if there is none, then */ 1561*ae771770SStanislav Sedov /* copy the first b code points to the output. */ 1562*ae771770SStanislav Sedov 1563*ae771770SStanislav Sedov for (b = j = 0; j < input_length; ++j) if (delim(input[j])) b = j; 1564*ae771770SStanislav Sedov if (b > max_out) return punycode_big_output; 1565*ae771770SStanislav Sedov 1566*ae771770SStanislav Sedov for (j = 0; j < b; ++j) { 1567*ae771770SStanislav Sedov 1568*ae771770SStanislav Sedov 1569*ae771770SStanislav Sedov 1570*ae771770SStanislav SedovCostello Standards Track [Page 28] 1571*ae771770SStanislav Sedov 1572*ae771770SStanislav SedovRFC 3492 IDNA Punycode March 2003 1573*ae771770SStanislav Sedov 1574*ae771770SStanislav Sedov 1575*ae771770SStanislav Sedov if (case_flags) case_flags[out] = flagged(input[j]); 1576*ae771770SStanislav Sedov if (!basic(input[j])) return punycode_bad_input; 1577*ae771770SStanislav Sedov output[out++] = input[j]; 1578*ae771770SStanislav Sedov } 1579*ae771770SStanislav Sedov 1580*ae771770SStanislav Sedov /* Main decoding loop: Start just after the last delimiter if any */ 1581*ae771770SStanislav Sedov /* basic code points were copied; start at the beginning otherwise. */ 1582*ae771770SStanislav Sedov 1583*ae771770SStanislav Sedov for (in = b > 0 ? b + 1 : 0; in < input_length; ++out) { 1584*ae771770SStanislav Sedov 1585*ae771770SStanislav Sedov /* in is the index of the next character to be consumed, and */ 1586*ae771770SStanislav Sedov /* out is the number of code points in the output array. */ 1587*ae771770SStanislav Sedov 1588*ae771770SStanislav Sedov /* Decode a generalized variable-length integer into delta, */ 1589*ae771770SStanislav Sedov /* which gets added to i. The overflow checking is easier */ 1590*ae771770SStanislav Sedov /* if we increase i as we go, then subtract off its starting */ 1591*ae771770SStanislav Sedov /* value at the end to obtain delta. */ 1592*ae771770SStanislav Sedov 1593*ae771770SStanislav Sedov for (oldi = i, w = 1, k = base; ; k += base) { 1594*ae771770SStanislav Sedov if (in >= input_length) return punycode_bad_input; 1595*ae771770SStanislav Sedov digit = decode_digit(input[in++]); 1596*ae771770SStanislav Sedov if (digit >= base) return punycode_bad_input; 1597*ae771770SStanislav Sedov if (digit > (maxint - i) / w) return punycode_overflow; 1598*ae771770SStanislav Sedov i += digit * w; 1599*ae771770SStanislav Sedov t = k <= bias /* + tmin */ ? tmin : /* +tmin not needed */ 1600*ae771770SStanislav Sedov k >= bias + tmax ? tmax : k - bias; 1601*ae771770SStanislav Sedov if (digit < t) break; 1602*ae771770SStanislav Sedov if (w > maxint / (base - t)) return punycode_overflow; 1603*ae771770SStanislav Sedov w *= (base - t); 1604*ae771770SStanislav Sedov } 1605*ae771770SStanislav Sedov 1606*ae771770SStanislav Sedov bias = adapt(i - oldi, out + 1, oldi == 0); 1607*ae771770SStanislav Sedov 1608*ae771770SStanislav Sedov /* i was supposed to wrap around from out+1 to 0, */ 1609*ae771770SStanislav Sedov /* incrementing n each time, so we'll fix that now: */ 1610*ae771770SStanislav Sedov 1611*ae771770SStanislav Sedov if (i / (out + 1) > maxint - n) return punycode_overflow; 1612*ae771770SStanislav Sedov n += i / (out + 1); 1613*ae771770SStanislav Sedov i %= (out + 1); 1614*ae771770SStanislav Sedov 1615*ae771770SStanislav Sedov /* Insert n at position i of the output: */ 1616*ae771770SStanislav Sedov 1617*ae771770SStanislav Sedov /* not needed for Punycode: */ 1618*ae771770SStanislav Sedov /* if (decode_digit(n) <= base) return punycode_invalid_input; */ 1619*ae771770SStanislav Sedov if (out >= max_out) return punycode_big_output; 1620*ae771770SStanislav Sedov 1621*ae771770SStanislav Sedov if (case_flags) { 1622*ae771770SStanislav Sedov memmove(case_flags + i + 1, case_flags + i, out - i); 1623*ae771770SStanislav Sedov 1624*ae771770SStanislav Sedov 1625*ae771770SStanislav Sedov 1626*ae771770SStanislav SedovCostello Standards Track [Page 29] 1627*ae771770SStanislav Sedov 1628*ae771770SStanislav SedovRFC 3492 IDNA Punycode March 2003 1629*ae771770SStanislav Sedov 1630*ae771770SStanislav Sedov 1631*ae771770SStanislav Sedov /* Case of last character determines uppercase flag: */ 1632*ae771770SStanislav Sedov case_flags[i] = flagged(input[in - 1]); 1633*ae771770SStanislav Sedov } 1634*ae771770SStanislav Sedov 1635*ae771770SStanislav Sedov memmove(output + i + 1, output + i, (out - i) * sizeof *output); 1636*ae771770SStanislav Sedov output[i++] = n; 1637*ae771770SStanislav Sedov } 1638*ae771770SStanislav Sedov 1639*ae771770SStanislav Sedov *output_length = out; 1640*ae771770SStanislav Sedov return punycode_success; 1641*ae771770SStanislav Sedov} 1642*ae771770SStanislav Sedov 1643*ae771770SStanislav Sedov/******************************************************************/ 1644*ae771770SStanislav Sedov/* Wrapper for testing (would normally go in a separate .c file): */ 1645*ae771770SStanislav Sedov 1646*ae771770SStanislav Sedov#include <assert.h> 1647*ae771770SStanislav Sedov#include <stdio.h> 1648*ae771770SStanislav Sedov#include <stdlib.h> 1649*ae771770SStanislav Sedov#include <string.h> 1650*ae771770SStanislav Sedov 1651*ae771770SStanislav Sedov/* For testing, we'll just set some compile-time limits rather than */ 1652*ae771770SStanislav Sedov/* use malloc(), and set a compile-time option rather than using a */ 1653*ae771770SStanislav Sedov/* command-line option. */ 1654*ae771770SStanislav Sedov 1655*ae771770SStanislav Sedovenum { 1656*ae771770SStanislav Sedov unicode_max_length = 256, 1657*ae771770SStanislav Sedov ace_max_length = 256 1658*ae771770SStanislav Sedov}; 1659*ae771770SStanislav Sedov 1660*ae771770SStanislav Sedovstatic void usage(char **argv) 1661*ae771770SStanislav Sedov{ 1662*ae771770SStanislav Sedov fprintf(stderr, 1663*ae771770SStanislav Sedov "\n" 1664*ae771770SStanislav Sedov "%s -e reads code points and writes a Punycode string.\n" 1665*ae771770SStanislav Sedov "%s -d reads a Punycode string and writes code points.\n" 1666*ae771770SStanislav Sedov "\n" 1667*ae771770SStanislav Sedov "Input and output are plain text in the native character set.\n" 1668*ae771770SStanislav Sedov "Code points are in the form u+hex separated by whitespace.\n" 1669*ae771770SStanislav Sedov "Although the specification allows Punycode strings to contain\n" 1670*ae771770SStanislav Sedov "any characters from the ASCII repertoire, this test code\n" 1671*ae771770SStanislav Sedov "supports only the printable characters, and needs the Punycode\n" 1672*ae771770SStanislav Sedov "string to be followed by a newline.\n" 1673*ae771770SStanislav Sedov "The case of the u in u+hex is the force-to-uppercase flag.\n" 1674*ae771770SStanislav Sedov , argv[0], argv[0]); 1675*ae771770SStanislav Sedov exit(EXIT_FAILURE); 1676*ae771770SStanislav Sedov} 1677*ae771770SStanislav Sedov 1678*ae771770SStanislav Sedovstatic void fail(const char *msg) 1679*ae771770SStanislav Sedov 1680*ae771770SStanislav Sedov 1681*ae771770SStanislav Sedov 1682*ae771770SStanislav SedovCostello Standards Track [Page 30] 1683*ae771770SStanislav Sedov 1684*ae771770SStanislav SedovRFC 3492 IDNA Punycode March 2003 1685*ae771770SStanislav Sedov 1686*ae771770SStanislav Sedov 1687*ae771770SStanislav Sedov{ 1688*ae771770SStanislav Sedov fputs(msg,stderr); 1689*ae771770SStanislav Sedov exit(EXIT_FAILURE); 1690*ae771770SStanislav Sedov} 1691*ae771770SStanislav Sedov 1692*ae771770SStanislav Sedovstatic const char too_big[] = 1693*ae771770SStanislav Sedov "input or output is too large, recompile with larger limits\n"; 1694*ae771770SStanislav Sedovstatic const char invalid_input[] = "invalid input\n"; 1695*ae771770SStanislav Sedovstatic const char overflow[] = "arithmetic overflow\n"; 1696*ae771770SStanislav Sedovstatic const char io_error[] = "I/O error\n"; 1697*ae771770SStanislav Sedov 1698*ae771770SStanislav Sedov/* The following string is used to convert printable */ 1699*ae771770SStanislav Sedov/* characters between ASCII and the native charset: */ 1700*ae771770SStanislav Sedov 1701*ae771770SStanislav Sedovstatic const char print_ascii[] = 1702*ae771770SStanislav Sedov "\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n" 1703*ae771770SStanislav Sedov "\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n" 1704*ae771770SStanislav Sedov " !\"#$%&'()*+,-./" 1705*ae771770SStanislav Sedov "0123456789:;<=>?" 1706*ae771770SStanislav Sedov "@ABCDEFGHIJKLMNO" 1707*ae771770SStanislav Sedov "PQRSTUVWXYZ[\\]^_" 1708*ae771770SStanislav Sedov "`abcdefghijklmno" 1709*ae771770SStanislav Sedov "pqrstuvwxyz{|}~\n"; 1710*ae771770SStanislav Sedov 1711*ae771770SStanislav Sedovint main(int argc, char **argv) 1712*ae771770SStanislav Sedov{ 1713*ae771770SStanislav Sedov enum punycode_status status; 1714*ae771770SStanislav Sedov int r; 1715*ae771770SStanislav Sedov unsigned int input_length, output_length, j; 1716*ae771770SStanislav Sedov unsigned char case_flags[unicode_max_length]; 1717*ae771770SStanislav Sedov 1718*ae771770SStanislav Sedov if (argc != 2) usage(argv); 1719*ae771770SStanislav Sedov if (argv[1][0] != '-') usage(argv); 1720*ae771770SStanislav Sedov if (argv[1][2] != 0) usage(argv); 1721*ae771770SStanislav Sedov 1722*ae771770SStanislav Sedov if (argv[1][1] == 'e') { 1723*ae771770SStanislav Sedov punycode_uint input[unicode_max_length]; 1724*ae771770SStanislav Sedov unsigned long codept; 1725*ae771770SStanislav Sedov char output[ace_max_length+1], uplus[3]; 1726*ae771770SStanislav Sedov int c; 1727*ae771770SStanislav Sedov 1728*ae771770SStanislav Sedov /* Read the input code points: */ 1729*ae771770SStanislav Sedov 1730*ae771770SStanislav Sedov input_length = 0; 1731*ae771770SStanislav Sedov 1732*ae771770SStanislav Sedov for (;;) { 1733*ae771770SStanislav Sedov r = scanf("%2s%lx", uplus, &codept); 1734*ae771770SStanislav Sedov if (ferror(stdin)) fail(io_error); 1735*ae771770SStanislav Sedov 1736*ae771770SStanislav Sedov 1737*ae771770SStanislav Sedov 1738*ae771770SStanislav SedovCostello Standards Track [Page 31] 1739*ae771770SStanislav Sedov 1740*ae771770SStanislav SedovRFC 3492 IDNA Punycode March 2003 1741*ae771770SStanislav Sedov 1742*ae771770SStanislav Sedov 1743*ae771770SStanislav Sedov if (r == EOF || r == 0) break; 1744*ae771770SStanislav Sedov 1745*ae771770SStanislav Sedov if (r != 2 || uplus[1] != '+' || codept > (punycode_uint)-1) { 1746*ae771770SStanislav Sedov fail(invalid_input); 1747*ae771770SStanislav Sedov } 1748*ae771770SStanislav Sedov 1749*ae771770SStanislav Sedov if (input_length == unicode_max_length) fail(too_big); 1750*ae771770SStanislav Sedov 1751*ae771770SStanislav Sedov if (uplus[0] == 'u') case_flags[input_length] = 0; 1752*ae771770SStanislav Sedov else if (uplus[0] == 'U') case_flags[input_length] = 1; 1753*ae771770SStanislav Sedov else fail(invalid_input); 1754*ae771770SStanislav Sedov 1755*ae771770SStanislav Sedov input[input_length++] = codept; 1756*ae771770SStanislav Sedov } 1757*ae771770SStanislav Sedov 1758*ae771770SStanislav Sedov /* Encode: */ 1759*ae771770SStanislav Sedov 1760*ae771770SStanislav Sedov output_length = ace_max_length; 1761*ae771770SStanislav Sedov status = punycode_encode(input_length, input, case_flags, 1762*ae771770SStanislav Sedov &output_length, output); 1763*ae771770SStanislav Sedov if (status == punycode_bad_input) fail(invalid_input); 1764*ae771770SStanislav Sedov if (status == punycode_big_output) fail(too_big); 1765*ae771770SStanislav Sedov if (status == punycode_overflow) fail(overflow); 1766*ae771770SStanislav Sedov assert(status == punycode_success); 1767*ae771770SStanislav Sedov 1768*ae771770SStanislav Sedov /* Convert to native charset and output: */ 1769*ae771770SStanislav Sedov 1770*ae771770SStanislav Sedov for (j = 0; j < output_length; ++j) { 1771*ae771770SStanislav Sedov c = output[j]; 1772*ae771770SStanislav Sedov assert(c >= 0 && c <= 127); 1773*ae771770SStanislav Sedov if (print_ascii[c] == 0) fail(invalid_input); 1774*ae771770SStanislav Sedov output[j] = print_ascii[c]; 1775*ae771770SStanislav Sedov } 1776*ae771770SStanislav Sedov 1777*ae771770SStanislav Sedov output[j] = 0; 1778*ae771770SStanislav Sedov r = puts(output); 1779*ae771770SStanislav Sedov if (r == EOF) fail(io_error); 1780*ae771770SStanislav Sedov return EXIT_SUCCESS; 1781*ae771770SStanislav Sedov } 1782*ae771770SStanislav Sedov 1783*ae771770SStanislav Sedov if (argv[1][1] == 'd') { 1784*ae771770SStanislav Sedov char input[ace_max_length+2], *p, *pp; 1785*ae771770SStanislav Sedov punycode_uint output[unicode_max_length]; 1786*ae771770SStanislav Sedov 1787*ae771770SStanislav Sedov /* Read the Punycode input string and convert to ASCII: */ 1788*ae771770SStanislav Sedov 1789*ae771770SStanislav Sedov fgets(input, ace_max_length+2, stdin); 1790*ae771770SStanislav Sedov if (ferror(stdin)) fail(io_error); 1791*ae771770SStanislav Sedov 1792*ae771770SStanislav Sedov 1793*ae771770SStanislav Sedov 1794*ae771770SStanislav SedovCostello Standards Track [Page 32] 1795*ae771770SStanislav Sedov 1796*ae771770SStanislav SedovRFC 3492 IDNA Punycode March 2003 1797*ae771770SStanislav Sedov 1798*ae771770SStanislav Sedov 1799*ae771770SStanislav Sedov if (feof(stdin)) fail(invalid_input); 1800*ae771770SStanislav Sedov input_length = strlen(input) - 1; 1801*ae771770SStanislav Sedov if (input[input_length] != '\n') fail(too_big); 1802*ae771770SStanislav Sedov input[input_length] = 0; 1803*ae771770SStanislav Sedov 1804*ae771770SStanislav Sedov for (p = input; *p != 0; ++p) { 1805*ae771770SStanislav Sedov pp = strchr(print_ascii, *p); 1806*ae771770SStanislav Sedov if (pp == 0) fail(invalid_input); 1807*ae771770SStanislav Sedov *p = pp - print_ascii; 1808*ae771770SStanislav Sedov } 1809*ae771770SStanislav Sedov 1810*ae771770SStanislav Sedov /* Decode: */ 1811*ae771770SStanislav Sedov 1812*ae771770SStanislav Sedov output_length = unicode_max_length; 1813*ae771770SStanislav Sedov status = punycode_decode(input_length, input, &output_length, 1814*ae771770SStanislav Sedov output, case_flags); 1815*ae771770SStanislav Sedov if (status == punycode_bad_input) fail(invalid_input); 1816*ae771770SStanislav Sedov if (status == punycode_big_output) fail(too_big); 1817*ae771770SStanislav Sedov if (status == punycode_overflow) fail(overflow); 1818*ae771770SStanislav Sedov assert(status == punycode_success); 1819*ae771770SStanislav Sedov 1820*ae771770SStanislav Sedov /* Output the result: */ 1821*ae771770SStanislav Sedov 1822*ae771770SStanislav Sedov for (j = 0; j < output_length; ++j) { 1823*ae771770SStanislav Sedov r = printf("%s+%04lX\n", 1824*ae771770SStanislav Sedov case_flags[j] ? "U" : "u", 1825*ae771770SStanislav Sedov (unsigned long) output[j] ); 1826*ae771770SStanislav Sedov if (r < 0) fail(io_error); 1827*ae771770SStanislav Sedov } 1828*ae771770SStanislav Sedov 1829*ae771770SStanislav Sedov return EXIT_SUCCESS; 1830*ae771770SStanislav Sedov } 1831*ae771770SStanislav Sedov 1832*ae771770SStanislav Sedov usage(argv); 1833*ae771770SStanislav Sedov return EXIT_SUCCESS; /* not reached, but quiets compiler warning */ 1834*ae771770SStanislav Sedov} 1835*ae771770SStanislav Sedov 1836*ae771770SStanislav Sedov 1837*ae771770SStanislav Sedov 1838*ae771770SStanislav Sedov 1839*ae771770SStanislav Sedov 1840*ae771770SStanislav Sedov 1841*ae771770SStanislav Sedov 1842*ae771770SStanislav Sedov 1843*ae771770SStanislav Sedov 1844*ae771770SStanislav Sedov 1845*ae771770SStanislav Sedov 1846*ae771770SStanislav Sedov 1847*ae771770SStanislav Sedov 1848*ae771770SStanislav Sedov 1849*ae771770SStanislav Sedov 1850*ae771770SStanislav SedovCostello Standards Track [Page 33] 1851*ae771770SStanislav Sedov 1852*ae771770SStanislav SedovRFC 3492 IDNA Punycode March 2003 1853*ae771770SStanislav Sedov 1854*ae771770SStanislav Sedov 1855*ae771770SStanislav SedovAuthor's Address 1856*ae771770SStanislav Sedov 1857*ae771770SStanislav Sedov Adam M. Costello 1858*ae771770SStanislav Sedov University of California, Berkeley 1859*ae771770SStanislav Sedov http://www.nicemice.net/amc/ 1860*ae771770SStanislav Sedov 1861*ae771770SStanislav Sedov 1862*ae771770SStanislav Sedov 1863*ae771770SStanislav Sedov 1864*ae771770SStanislav Sedov 1865*ae771770SStanislav Sedov 1866*ae771770SStanislav Sedov 1867*ae771770SStanislav Sedov 1868*ae771770SStanislav Sedov 1869*ae771770SStanislav Sedov 1870*ae771770SStanislav Sedov 1871*ae771770SStanislav Sedov 1872*ae771770SStanislav Sedov 1873*ae771770SStanislav Sedov 1874*ae771770SStanislav Sedov 1875*ae771770SStanislav Sedov 1876*ae771770SStanislav Sedov 1877*ae771770SStanislav Sedov 1878*ae771770SStanislav Sedov 1879*ae771770SStanislav Sedov 1880*ae771770SStanislav Sedov 1881*ae771770SStanislav Sedov 1882*ae771770SStanislav Sedov 1883*ae771770SStanislav Sedov 1884*ae771770SStanislav Sedov 1885*ae771770SStanislav Sedov 1886*ae771770SStanislav Sedov 1887*ae771770SStanislav Sedov 1888*ae771770SStanislav Sedov 1889*ae771770SStanislav Sedov 1890*ae771770SStanislav Sedov 1891*ae771770SStanislav Sedov 1892*ae771770SStanislav Sedov 1893*ae771770SStanislav Sedov 1894*ae771770SStanislav Sedov 1895*ae771770SStanislav Sedov 1896*ae771770SStanislav Sedov 1897*ae771770SStanislav Sedov 1898*ae771770SStanislav Sedov 1899*ae771770SStanislav Sedov 1900*ae771770SStanislav Sedov 1901*ae771770SStanislav Sedov 1902*ae771770SStanislav Sedov 1903*ae771770SStanislav Sedov 1904*ae771770SStanislav Sedov 1905*ae771770SStanislav Sedov 1906*ae771770SStanislav SedovCostello Standards Track [Page 34] 1907*ae771770SStanislav Sedov 1908*ae771770SStanislav SedovRFC 3492 IDNA Punycode March 2003 1909*ae771770SStanislav Sedov 1910*ae771770SStanislav Sedov 1911*ae771770SStanislav SedovFull Copyright Statement 1912*ae771770SStanislav Sedov 1913*ae771770SStanislav Sedov Copyright (C) The Internet Society (2003). All Rights Reserved. 1914*ae771770SStanislav Sedov 1915*ae771770SStanislav Sedov This document and translations of it may be copied and furnished to 1916*ae771770SStanislav Sedov others, and derivative works that comment on or otherwise explain it 1917*ae771770SStanislav Sedov or assist in its implementation may be prepared, copied, published 1918*ae771770SStanislav Sedov and distributed, in whole or in part, without restriction of any 1919*ae771770SStanislav Sedov kind, provided that the above copyright notice and this paragraph are 1920*ae771770SStanislav Sedov included on all such copies and derivative works. However, this 1921*ae771770SStanislav Sedov document itself may not be modified in any way, such as by removing 1922*ae771770SStanislav Sedov the copyright notice or references to the Internet Society or other 1923*ae771770SStanislav Sedov Internet organizations, except as needed for the purpose of 1924*ae771770SStanislav Sedov developing Internet standards in which case the procedures for 1925*ae771770SStanislav Sedov copyrights defined in the Internet Standards process must be 1926*ae771770SStanislav Sedov followed, or as required to translate it into languages other than 1927*ae771770SStanislav Sedov English. 1928*ae771770SStanislav Sedov 1929*ae771770SStanislav Sedov The limited permissions granted above are perpetual and will not be 1930*ae771770SStanislav Sedov revoked by the Internet Society or its successors or assigns. 1931*ae771770SStanislav Sedov 1932*ae771770SStanislav Sedov This document and the information contained herein is provided on an 1933*ae771770SStanislav Sedov "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING 1934*ae771770SStanislav Sedov TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING 1935*ae771770SStanislav Sedov BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION 1936*ae771770SStanislav Sedov HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF 1937*ae771770SStanislav Sedov MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. 1938*ae771770SStanislav Sedov 1939*ae771770SStanislav SedovAcknowledgement 1940*ae771770SStanislav Sedov 1941*ae771770SStanislav Sedov Funding for the RFC Editor function is currently provided by the 1942*ae771770SStanislav Sedov Internet Society. 1943*ae771770SStanislav Sedov 1944*ae771770SStanislav Sedov 1945*ae771770SStanislav Sedov 1946*ae771770SStanislav Sedov 1947*ae771770SStanislav Sedov 1948*ae771770SStanislav Sedov 1949*ae771770SStanislav Sedov 1950*ae771770SStanislav Sedov 1951*ae771770SStanislav Sedov 1952*ae771770SStanislav Sedov 1953*ae771770SStanislav Sedov 1954*ae771770SStanislav Sedov 1955*ae771770SStanislav Sedov 1956*ae771770SStanislav Sedov 1957*ae771770SStanislav Sedov 1958*ae771770SStanislav Sedov 1959*ae771770SStanislav Sedov 1960*ae771770SStanislav Sedov 1961*ae771770SStanislav Sedov 1962*ae771770SStanislav SedovCostello Standards Track [Page 35] 1963*ae771770SStanislav Sedov 1964