xref: /freebsd/crypto/heimdal/lib/wind/rfc3492.txt (revision ae771770)
1*ae771770SStanislav Sedov
2*ae771770SStanislav Sedov
3*ae771770SStanislav Sedov
4*ae771770SStanislav Sedov
5*ae771770SStanislav Sedov
6*ae771770SStanislav Sedov
7*ae771770SStanislav SedovNetwork Working Group                                        A. Costello
8*ae771770SStanislav SedovRequest for Comments: 3492                 Univ. of California, Berkeley
9*ae771770SStanislav SedovCategory: Standards Track                                     March 2003
10*ae771770SStanislav Sedov
11*ae771770SStanislav Sedov
12*ae771770SStanislav Sedov              Punycode: A Bootstring encoding of Unicode
13*ae771770SStanislav Sedov       for Internationalized Domain Names in Applications (IDNA)
14*ae771770SStanislav Sedov
15*ae771770SStanislav SedovStatus of this Memo
16*ae771770SStanislav Sedov
17*ae771770SStanislav Sedov   This document specifies an Internet standards track protocol for the
18*ae771770SStanislav Sedov   Internet community, and requests discussion and suggestions for
19*ae771770SStanislav Sedov   improvements.  Please refer to the current edition of the "Internet
20*ae771770SStanislav Sedov   Official Protocol Standards" (STD 1) for the standardization state
21*ae771770SStanislav Sedov   and status of this protocol.  Distribution of this memo is unlimited.
22*ae771770SStanislav Sedov
23*ae771770SStanislav SedovCopyright Notice
24*ae771770SStanislav Sedov
25*ae771770SStanislav Sedov   Copyright (C) The Internet Society (2003).  All Rights Reserved.
26*ae771770SStanislav Sedov
27*ae771770SStanislav SedovAbstract
28*ae771770SStanislav Sedov
29*ae771770SStanislav Sedov   Punycode is a simple and efficient transfer encoding syntax designed
30*ae771770SStanislav Sedov   for use with Internationalized Domain Names in Applications (IDNA).
31*ae771770SStanislav Sedov   It uniquely and reversibly transforms a Unicode string into an ASCII
32*ae771770SStanislav Sedov   string.  ASCII characters in the Unicode string are represented
33*ae771770SStanislav Sedov   literally, and non-ASCII characters are represented by ASCII
34*ae771770SStanislav Sedov   characters that are allowed in host name labels (letters, digits, and
35*ae771770SStanislav Sedov   hyphens).  This document defines a general algorithm called
36*ae771770SStanislav Sedov   Bootstring that allows a string of basic code points to uniquely
37*ae771770SStanislav Sedov   represent any string of code points drawn from a larger set.
38*ae771770SStanislav Sedov   Punycode is an instance of Bootstring that uses particular parameter
39*ae771770SStanislav Sedov   values specified by this document, appropriate for IDNA.
40*ae771770SStanislav Sedov
41*ae771770SStanislav SedovTable of Contents
42*ae771770SStanislav Sedov
43*ae771770SStanislav Sedov   1. Introduction...............................................2
44*ae771770SStanislav Sedov       1.1 Features..............................................2
45*ae771770SStanislav Sedov       1.2 Interaction of protocol parts.........................3
46*ae771770SStanislav Sedov   2. Terminology................................................3
47*ae771770SStanislav Sedov   3. Bootstring description.....................................4
48*ae771770SStanislav Sedov       3.1 Basic code point segregation..........................4
49*ae771770SStanislav Sedov       3.2 Insertion unsort coding...............................4
50*ae771770SStanislav Sedov       3.3 Generalized variable-length integers..................5
51*ae771770SStanislav Sedov       3.4 Bias adaptation.......................................7
52*ae771770SStanislav Sedov   4. Bootstring parameters......................................8
53*ae771770SStanislav Sedov   5. Parameter values for Punycode..............................8
54*ae771770SStanislav Sedov   6. Bootstring algorithms......................................9
55*ae771770SStanislav Sedov
56*ae771770SStanislav Sedov
57*ae771770SStanislav Sedov
58*ae771770SStanislav SedovCostello                    Standards Track                     [Page 1]
59*ae771770SStanislav Sedov
60*ae771770SStanislav SedovRFC 3492                     IDNA Punycode                    March 2003
61*ae771770SStanislav Sedov
62*ae771770SStanislav Sedov
63*ae771770SStanislav Sedov       6.1 Bias adaptation function.............................10
64*ae771770SStanislav Sedov       6.2 Decoding procedure...................................11
65*ae771770SStanislav Sedov       6.3 Encoding procedure...................................12
66*ae771770SStanislav Sedov       6.4 Overflow handling....................................13
67*ae771770SStanislav Sedov   7. Punycode examples.........................................14
68*ae771770SStanislav Sedov       7.1 Sample strings.......................................14
69*ae771770SStanislav Sedov       7.2 Decoding traces......................................17
70*ae771770SStanislav Sedov       7.3 Encoding traces......................................19
71*ae771770SStanislav Sedov   8. Security Considerations...................................20
72*ae771770SStanislav Sedov   9. References................................................21
73*ae771770SStanislav Sedov       9.1 Normative References.................................21
74*ae771770SStanislav Sedov       9.2 Informative References...............................21
75*ae771770SStanislav Sedov   A. Mixed-case annotation.....................................22
76*ae771770SStanislav Sedov   B. Disclaimer and license....................................22
77*ae771770SStanislav Sedov   C. Punycode sample implementation............................23
78*ae771770SStanislav Sedov   Author's Address.............................................34
79*ae771770SStanislav Sedov   Full Copyright Statement.....................................35
80*ae771770SStanislav Sedov
81*ae771770SStanislav Sedov1. Introduction
82*ae771770SStanislav Sedov
83*ae771770SStanislav Sedov   [IDNA] describes an architecture for supporting internationalized
84*ae771770SStanislav Sedov   domain names.  Labels containing non-ASCII characters can be
85*ae771770SStanislav Sedov   represented by ACE labels, which begin with a special ACE prefix and
86*ae771770SStanislav Sedov   contain only ASCII characters.  The remainder of the label after the
87*ae771770SStanislav Sedov   prefix is a Punycode encoding of a Unicode string satisfying certain
88*ae771770SStanislav Sedov   constraints.  For the details of the prefix and constraints, see
89*ae771770SStanislav Sedov   [IDNA] and [NAMEPREP].
90*ae771770SStanislav Sedov
91*ae771770SStanislav Sedov   Punycode is an instance of a more general algorithm called
92*ae771770SStanislav Sedov   Bootstring, which allows strings composed from a small set of "basic"
93*ae771770SStanislav Sedov   code points to uniquely represent any string of code points drawn
94*ae771770SStanislav Sedov   from a larger set.  Punycode is Bootstring with particular parameter
95*ae771770SStanislav Sedov   values appropriate for IDNA.
96*ae771770SStanislav Sedov
97*ae771770SStanislav Sedov1.1 Features
98*ae771770SStanislav Sedov
99*ae771770SStanislav Sedov   Bootstring has been designed to have the following features:
100*ae771770SStanislav Sedov
101*ae771770SStanislav Sedov   *  Completeness:  Every extended string (sequence of arbitrary code
102*ae771770SStanislav Sedov      points) can be represented by a basic string (sequence of basic
103*ae771770SStanislav Sedov      code points).  Restrictions on what strings are allowed, and on
104*ae771770SStanislav Sedov      length, can be imposed by higher layers.
105*ae771770SStanislav Sedov
106*ae771770SStanislav Sedov   *  Uniqueness:  There is at most one basic string that represents a
107*ae771770SStanislav Sedov      given extended string.
108*ae771770SStanislav Sedov
109*ae771770SStanislav Sedov   *  Reversibility:  Any extended string mapped to a basic string can
110*ae771770SStanislav Sedov      be recovered from that basic string.
111*ae771770SStanislav Sedov
112*ae771770SStanislav Sedov
113*ae771770SStanislav Sedov
114*ae771770SStanislav SedovCostello                    Standards Track                     [Page 2]
115*ae771770SStanislav Sedov
116*ae771770SStanislav SedovRFC 3492                     IDNA Punycode                    March 2003
117*ae771770SStanislav Sedov
118*ae771770SStanislav Sedov
119*ae771770SStanislav Sedov   *  Efficient encoding:  The ratio of basic string length to extended
120*ae771770SStanislav Sedov      string length is small.  This is important in the context of
121*ae771770SStanislav Sedov      domain names because RFC 1034 [RFC1034] restricts the length of a
122*ae771770SStanislav Sedov      domain label to 63 characters.
123*ae771770SStanislav Sedov
124*ae771770SStanislav Sedov   *  Simplicity:  The encoding and decoding algorithms are reasonably
125*ae771770SStanislav Sedov      simple to implement.  The goals of efficiency and simplicity are
126*ae771770SStanislav Sedov      at odds; Bootstring aims at a good balance between them.
127*ae771770SStanislav Sedov
128*ae771770SStanislav Sedov   *  Readability:  Basic code points appearing in the extended string
129*ae771770SStanislav Sedov      are represented as themselves in the basic string (although the
130*ae771770SStanislav Sedov      main purpose is to improve efficiency, not readability).
131*ae771770SStanislav Sedov
132*ae771770SStanislav Sedov   Punycode can also support an additional feature that is not used by
133*ae771770SStanislav Sedov   the ToASCII and ToUnicode operations of [IDNA].  When extended
134*ae771770SStanislav Sedov   strings are case-folded prior to encoding, the basic string can use
135*ae771770SStanislav Sedov   mixed case to tell how to convert the folded string into a mixed-case
136*ae771770SStanislav Sedov   string.  See appendix A "Mixed-case annotation".
137*ae771770SStanislav Sedov
138*ae771770SStanislav Sedov1.2 Interaction of protocol parts
139*ae771770SStanislav Sedov
140*ae771770SStanislav Sedov   Punycode is used by the IDNA protocol [IDNA] for converting domain
141*ae771770SStanislav Sedov   labels into ASCII; it is not designed for any other purpose.  It is
142*ae771770SStanislav Sedov   explicitly not designed for processing arbitrary free text.
143*ae771770SStanislav Sedov
144*ae771770SStanislav Sedov2. Terminology
145*ae771770SStanislav Sedov
146*ae771770SStanislav Sedov   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
147*ae771770SStanislav Sedov   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
148*ae771770SStanislav Sedov   document are to be interpreted as described in BCP 14, RFC 2119
149*ae771770SStanislav Sedov   [RFC2119].
150*ae771770SStanislav Sedov
151*ae771770SStanislav Sedov   A code point is an integral value associated with a character in a
152*ae771770SStanislav Sedov   coded character set.
153*ae771770SStanislav Sedov
154*ae771770SStanislav Sedov   As in the Unicode Standard [UNICODE], Unicode code points are denoted
155*ae771770SStanislav Sedov   by "U+" followed by four to six hexadecimal digits, while a range of
156*ae771770SStanislav Sedov   code points is denoted by two hexadecimal numbers separated by "..",
157*ae771770SStanislav Sedov   with no prefixes.
158*ae771770SStanislav Sedov
159*ae771770SStanislav Sedov   The operators div and mod perform integer division; (x div y) is the
160*ae771770SStanislav Sedov   quotient of x divided by y, discarding the remainder, and (x mod y)
161*ae771770SStanislav Sedov   is the remainder, so (x div y) * y + (x mod y) == x.  Bootstring uses
162*ae771770SStanislav Sedov   these operators only with nonnegative operands, so the quotient and
163*ae771770SStanislav Sedov   remainder are always nonnegative.
164*ae771770SStanislav Sedov
165*ae771770SStanislav Sedov   The break statement jumps out of the innermost loop (as in C).
166*ae771770SStanislav Sedov
167*ae771770SStanislav Sedov
168*ae771770SStanislav Sedov
169*ae771770SStanislav Sedov
170*ae771770SStanislav SedovCostello                    Standards Track                     [Page 3]
171*ae771770SStanislav Sedov
172*ae771770SStanislav SedovRFC 3492                     IDNA Punycode                    March 2003
173*ae771770SStanislav Sedov
174*ae771770SStanislav Sedov
175*ae771770SStanislav Sedov   An overflow is an attempt to compute a value that exceeds the maximum
176*ae771770SStanislav Sedov   value of an integer variable.
177*ae771770SStanislav Sedov
178*ae771770SStanislav Sedov3. Bootstring description
179*ae771770SStanislav Sedov
180*ae771770SStanislav Sedov   Bootstring represents an arbitrary sequence of code points (the
181*ae771770SStanislav Sedov   "extended string") as a sequence of basic code points (the "basic
182*ae771770SStanislav Sedov   string").  This section describes the representation.  Section 6
183*ae771770SStanislav Sedov   "Bootstring algorithms" presents the algorithms as pseudocode.
184*ae771770SStanislav Sedov   Sections 7.1 "Decoding traces" and 7.2 "Encoding traces" trace the
185*ae771770SStanislav Sedov   algorithms for sample inputs.
186*ae771770SStanislav Sedov
187*ae771770SStanislav Sedov   The following sections describe the four techniques used in
188*ae771770SStanislav Sedov   Bootstring.  "Basic code point segregation" is a very simple and
189*ae771770SStanislav Sedov   efficient encoding for basic code points occurring in the extended
190*ae771770SStanislav Sedov   string: they are simply copied all at once.  "Insertion unsort
191*ae771770SStanislav Sedov   coding" encodes the non-basic code points as deltas, and processes
192*ae771770SStanislav Sedov   the code points in numerical order rather than in order of
193*ae771770SStanislav Sedov   appearance, which typically results in smaller deltas.  The deltas
194*ae771770SStanislav Sedov   are represented as "generalized variable-length integers", which use
195*ae771770SStanislav Sedov   basic code points to represent nonnegative integers.  The parameters
196*ae771770SStanislav Sedov   of this integer representation are dynamically adjusted using "bias
197*ae771770SStanislav Sedov   adaptation", to improve efficiency when consecutive deltas have
198*ae771770SStanislav Sedov   similar magnitudes.
199*ae771770SStanislav Sedov
200*ae771770SStanislav Sedov3.1 Basic code point segregation
201*ae771770SStanislav Sedov
202*ae771770SStanislav Sedov   All basic code points appearing in the extended string are
203*ae771770SStanislav Sedov   represented literally at the beginning of the basic string, in their
204*ae771770SStanislav Sedov   original order, followed by a delimiter if (and only if) the number
205*ae771770SStanislav Sedov   of basic code points is nonzero.  The delimiter is a particular basic
206*ae771770SStanislav Sedov   code point, which never appears in the remainder of the basic string.
207*ae771770SStanislav Sedov   The decoder can therefore find the end of the literal portion (if
208*ae771770SStanislav Sedov   there is one) by scanning for the last delimiter.
209*ae771770SStanislav Sedov
210*ae771770SStanislav Sedov3.2 Insertion unsort coding
211*ae771770SStanislav Sedov
212*ae771770SStanislav Sedov   The remainder of the basic string (after the last delimiter if there
213*ae771770SStanislav Sedov   is one) represents a sequence of nonnegative integral deltas as
214*ae771770SStanislav Sedov   generalized variable-length integers, described in section 3.3.  The
215*ae771770SStanislav Sedov   meaning of the deltas is best understood in terms of the decoder.
216*ae771770SStanislav Sedov
217*ae771770SStanislav Sedov   The decoder builds the extended string incrementally.  Initially, the
218*ae771770SStanislav Sedov   extended string is a copy of the literal portion of the basic string
219*ae771770SStanislav Sedov   (excluding the last delimiter).  The decoder inserts non-basic code
220*ae771770SStanislav Sedov   points, one for each delta, into the extended string, ultimately
221*ae771770SStanislav Sedov   arriving at the final decoded string.
222*ae771770SStanislav Sedov
223*ae771770SStanislav Sedov
224*ae771770SStanislav Sedov
225*ae771770SStanislav Sedov
226*ae771770SStanislav SedovCostello                    Standards Track                     [Page 4]
227*ae771770SStanislav Sedov
228*ae771770SStanislav SedovRFC 3492                     IDNA Punycode                    March 2003
229*ae771770SStanislav Sedov
230*ae771770SStanislav Sedov
231*ae771770SStanislav Sedov   At the heart of this process is a state machine with two state
232*ae771770SStanislav Sedov   variables: an index i and a counter n.  The index i refers to a
233*ae771770SStanislav Sedov   position in the extended string; it ranges from 0 (the first
234*ae771770SStanislav Sedov   position) to the current length of the extended string (which refers
235*ae771770SStanislav Sedov   to a potential position beyond the current end).  If the current
236*ae771770SStanislav Sedov   state is <n,i>, the next state is <n,i+1> if i is less than the
237*ae771770SStanislav Sedov   length of the extended string, or <n+1,0> if i equals the length of
238*ae771770SStanislav Sedov   the extended string.  In other words, each state change causes i to
239*ae771770SStanislav Sedov   increment, wrapping around to zero if necessary, and n counts the
240*ae771770SStanislav Sedov   number of wrap-arounds.
241*ae771770SStanislav Sedov
242*ae771770SStanislav Sedov   Notice that the state always advances monotonically (there is no way
243*ae771770SStanislav Sedov   for the decoder to return to an earlier state).  At each state, an
244*ae771770SStanislav Sedov   insertion is either performed or not performed.  At most one
245*ae771770SStanislav Sedov   insertion is performed in a given state.  An insertion inserts the
246*ae771770SStanislav Sedov   value of n at position i in the extended string.  The deltas are a
247*ae771770SStanislav Sedov   run-length encoding of this sequence of events: they are the lengths
248*ae771770SStanislav Sedov   of the runs of non-insertion states preceeding the insertion states.
249*ae771770SStanislav Sedov   Hence, for each delta, the decoder performs delta state changes, then
250*ae771770SStanislav Sedov   an insertion, and then one more state change.  (An implementation
251*ae771770SStanislav Sedov   need not perform each state change individually, but can instead use
252*ae771770SStanislav Sedov   division and remainder calculations to compute the next insertion
253*ae771770SStanislav Sedov   state directly.)  It is an error if the inserted code point is a
254*ae771770SStanislav Sedov   basic code point (because basic code points were supposed to be
255*ae771770SStanislav Sedov   segregated as described in section 3.1).
256*ae771770SStanislav Sedov
257*ae771770SStanislav Sedov   The encoder's main task is to derive the sequence of deltas that will
258*ae771770SStanislav Sedov   cause the decoder to construct the desired string.  It can do this by
259*ae771770SStanislav Sedov   repeatedly scanning the extended string for the next code point that
260*ae771770SStanislav Sedov   the decoder would need to insert, and counting the number of state
261*ae771770SStanislav Sedov   changes the decoder would need to perform, mindful of the fact that
262*ae771770SStanislav Sedov   the decoder's extended string will include only those code points
263*ae771770SStanislav Sedov   that have already been inserted.  Section 6.3 "Encoding procedure"
264*ae771770SStanislav Sedov   gives a precise algorithm.
265*ae771770SStanislav Sedov
266*ae771770SStanislav Sedov3.3 Generalized variable-length integers
267*ae771770SStanislav Sedov
268*ae771770SStanislav Sedov   In a conventional integer representation the base is the number of
269*ae771770SStanislav Sedov   distinct symbols for digits, whose values are 0 through base-1.  Let
270*ae771770SStanislav Sedov   digit_0 denote the least significant digit, digit_1 the next least
271*ae771770SStanislav Sedov   significant, and so on.  The value represented is the sum over j of
272*ae771770SStanislav Sedov   digit_j * w(j), where w(j) = base^j is the weight (scale factor) for
273*ae771770SStanislav Sedov   position j.  For example, in the base 8 integer 437, the digits are
274*ae771770SStanislav Sedov   7, 3, and 4, and the weights are 1, 8, and 64, so the value is 7 +
275*ae771770SStanislav Sedov   3*8 + 4*64 = 287.  This representation has two disadvantages:  First,
276*ae771770SStanislav Sedov   there are multiple encodings of each value (because there can be
277*ae771770SStanislav Sedov   extra zeros in the most significant positions), which is inconvenient
278*ae771770SStanislav Sedov
279*ae771770SStanislav Sedov
280*ae771770SStanislav Sedov
281*ae771770SStanislav Sedov
282*ae771770SStanislav SedovCostello                    Standards Track                     [Page 5]
283*ae771770SStanislav Sedov
284*ae771770SStanislav SedovRFC 3492                     IDNA Punycode                    March 2003
285*ae771770SStanislav Sedov
286*ae771770SStanislav Sedov
287*ae771770SStanislav Sedov   when unique encodings are needed.  Second, the integer is not self-
288*ae771770SStanislav Sedov   delimiting, so if multiple integers are concatenated the boundaries
289*ae771770SStanislav Sedov   between them are lost.
290*ae771770SStanislav Sedov
291*ae771770SStanislav Sedov   The generalized variable-length representation solves these two
292*ae771770SStanislav Sedov   problems.  The digit values are still 0 through base-1, but now the
293*ae771770SStanislav Sedov   integer is self-delimiting by means of thresholds t(j), each of which
294*ae771770SStanislav Sedov   is in the range 0 through base-1.  Exactly one digit, the most
295*ae771770SStanislav Sedov   significant, satisfies digit_j < t(j).  Therefore, if several
296*ae771770SStanislav Sedov   integers are concatenated, it is easy to separate them, starting with
297*ae771770SStanislav Sedov   the first if they are little-endian (least significant digit first),
298*ae771770SStanislav Sedov   or starting with the last if they are big-endian (most significant
299*ae771770SStanislav Sedov   digit first).  As before, the value is the sum over j of digit_j *
300*ae771770SStanislav Sedov   w(j), but the weights are different:
301*ae771770SStanislav Sedov
302*ae771770SStanislav Sedov      w(0) = 1
303*ae771770SStanislav Sedov      w(j) = w(j-1) * (base - t(j-1)) for j > 0
304*ae771770SStanislav Sedov
305*ae771770SStanislav Sedov   For example, consider the little-endian sequence of base 8 digits
306*ae771770SStanislav Sedov   734251...  Suppose the thresholds are 2, 3, 5, 5, 5, 5...  This
307*ae771770SStanislav Sedov   implies that the weights are 1, 1*(8-2) = 6, 6*(8-3) = 30, 30*(8-5) =
308*ae771770SStanislav Sedov   90, 90*(8-5) = 270, and so on.  7 is not less than 2, and 3 is not
309*ae771770SStanislav Sedov   less than 3, but 4 is less than 5, so 4 is the last digit.  The value
310*ae771770SStanislav Sedov   of 734 is 7*1 + 3*6 + 4*30 = 145.  The next integer is 251, with
311*ae771770SStanislav Sedov   value 2*1 + 5*6 + 1*30 = 62.  Decoding this representation is very
312*ae771770SStanislav Sedov   similar to decoding a conventional integer:  Start with a current
313*ae771770SStanislav Sedov   value of N = 0 and a weight w = 1.  Fetch the next digit d and
314*ae771770SStanislav Sedov   increase N by d * w.  If d is less than the current threshold (t)
315*ae771770SStanislav Sedov   then stop, otherwise increase w by a factor of (base - t), update t
316*ae771770SStanislav Sedov   for the next position, and repeat.
317*ae771770SStanislav Sedov
318*ae771770SStanislav Sedov   Encoding this representation is similar to encoding a conventional
319*ae771770SStanislav Sedov   integer:  If N < t then output one digit for N and stop, otherwise
320*ae771770SStanislav Sedov   output the digit for t + ((N - t) mod (base - t)), then replace N
321*ae771770SStanislav Sedov   with (N - t) div (base - t), update t for the next position, and
322*ae771770SStanislav Sedov   repeat.
323*ae771770SStanislav Sedov
324*ae771770SStanislav Sedov   For any particular set of values of t(j), there is exactly one
325*ae771770SStanislav Sedov   generalized variable-length representation of each nonnegative
326*ae771770SStanislav Sedov   integral value.
327*ae771770SStanislav Sedov
328*ae771770SStanislav Sedov   Bootstring uses little-endian ordering so that the deltas can be
329*ae771770SStanislav Sedov   separated starting with the first.  The t(j) values are defined in
330*ae771770SStanislav Sedov   terms of the constants base, tmin, and tmax, and a state variable
331*ae771770SStanislav Sedov   called bias:
332*ae771770SStanislav Sedov
333*ae771770SStanislav Sedov      t(j) = base * (j + 1) - bias,
334*ae771770SStanislav Sedov      clamped to the range tmin through tmax
335*ae771770SStanislav Sedov
336*ae771770SStanislav Sedov
337*ae771770SStanislav Sedov
338*ae771770SStanislav SedovCostello                    Standards Track                     [Page 6]
339*ae771770SStanislav Sedov
340*ae771770SStanislav SedovRFC 3492                     IDNA Punycode                    March 2003
341*ae771770SStanislav Sedov
342*ae771770SStanislav Sedov
343*ae771770SStanislav Sedov   The clamping means that if the formula yields a value less than tmin
344*ae771770SStanislav Sedov   or greater than tmax, then t(j) = tmin or tmax, respectively.  (In
345*ae771770SStanislav Sedov   the pseudocode in section 6 "Bootstring algorithms", the expression
346*ae771770SStanislav Sedov   base * (j + 1) is denoted by k for performance reasons.)  These t(j)
347*ae771770SStanislav Sedov   values cause the representation to favor integers within a particular
348*ae771770SStanislav Sedov   range determined by the bias.
349*ae771770SStanislav Sedov
350*ae771770SStanislav Sedov3.4 Bias adaptation
351*ae771770SStanislav Sedov
352*ae771770SStanislav Sedov   After each delta is encoded or decoded, bias is set for the next
353*ae771770SStanislav Sedov   delta as follows:
354*ae771770SStanislav Sedov
355*ae771770SStanislav Sedov   1. Delta is scaled in order to avoid overflow in the next step:
356*ae771770SStanislav Sedov
357*ae771770SStanislav Sedov         let delta = delta div 2
358*ae771770SStanislav Sedov
359*ae771770SStanislav Sedov      But when this is the very first delta, the divisor is not 2, but
360*ae771770SStanislav Sedov      instead a constant called damp.  This compensates for the fact
361*ae771770SStanislav Sedov      that the second delta is usually much smaller than the first.
362*ae771770SStanislav Sedov
363*ae771770SStanislav Sedov   2. Delta is increased to compensate for the fact that the next delta
364*ae771770SStanislav Sedov      will be inserting into a longer string:
365*ae771770SStanislav Sedov
366*ae771770SStanislav Sedov         let delta = delta + (delta div numpoints)
367*ae771770SStanislav Sedov
368*ae771770SStanislav Sedov      numpoints is the total number of code points encoded/decoded so
369*ae771770SStanislav Sedov      far (including the one corresponding to this delta itself, and
370*ae771770SStanislav Sedov      including the basic code points).
371*ae771770SStanislav Sedov
372*ae771770SStanislav Sedov   3. Delta is repeatedly divided until it falls within a threshold, to
373*ae771770SStanislav Sedov      predict the minimum number of digits needed to represent the next
374*ae771770SStanislav Sedov      delta:
375*ae771770SStanislav Sedov
376*ae771770SStanislav Sedov         while delta > ((base - tmin) * tmax) div 2
377*ae771770SStanislav Sedov         do let delta = delta div (base - tmin)
378*ae771770SStanislav Sedov
379*ae771770SStanislav Sedov   4. The bias is set:
380*ae771770SStanislav Sedov
381*ae771770SStanislav Sedov         let bias =
382*ae771770SStanislav Sedov           (base * the number of divisions performed in step 3) +
383*ae771770SStanislav Sedov           (((base - tmin + 1) * delta) div (delta + skew))
384*ae771770SStanislav Sedov
385*ae771770SStanislav Sedov      The motivation for this procedure is that the current delta
386*ae771770SStanislav Sedov      provides a hint about the likely size of the next delta, and so
387*ae771770SStanislav Sedov      t(j) is set to tmax for the more significant digits starting with
388*ae771770SStanislav Sedov      the one expected to be last, tmin for the less significant digits
389*ae771770SStanislav Sedov      up through the one expected to be third-last, and somewhere
390*ae771770SStanislav Sedov      between tmin and tmax for the digit expected to be second-last
391*ae771770SStanislav Sedov
392*ae771770SStanislav Sedov
393*ae771770SStanislav Sedov
394*ae771770SStanislav SedovCostello                    Standards Track                     [Page 7]
395*ae771770SStanislav Sedov
396*ae771770SStanislav SedovRFC 3492                     IDNA Punycode                    March 2003
397*ae771770SStanislav Sedov
398*ae771770SStanislav Sedov
399*ae771770SStanislav Sedov      (balancing the hope of the expected-last digit being unnecessary
400*ae771770SStanislav Sedov      against the danger of it being insufficient).
401*ae771770SStanislav Sedov
402*ae771770SStanislav Sedov4. Bootstring parameters
403*ae771770SStanislav Sedov
404*ae771770SStanislav Sedov   Given a set of basic code points, one needs to be designated as the
405*ae771770SStanislav Sedov   delimiter.  The base cannot be greater than the number of
406*ae771770SStanislav Sedov   distinguishable basic code points remaining.  The digit-values in the
407*ae771770SStanislav Sedov   range 0 through base-1 need to be associated with distinct non-
408*ae771770SStanislav Sedov   delimiter basic code points.  In some cases multiple code points need
409*ae771770SStanislav Sedov   to have the same digit-value; for example, uppercase and lowercase
410*ae771770SStanislav Sedov   versions of the same letter need to be equivalent if basic strings
411*ae771770SStanislav Sedov   are case-insensitive.
412*ae771770SStanislav Sedov
413*ae771770SStanislav Sedov   The initial value of n cannot be greater than the minimum non-basic
414*ae771770SStanislav Sedov   code point that could appear in extended strings.
415*ae771770SStanislav Sedov
416*ae771770SStanislav Sedov   The remaining five parameters (tmin, tmax, skew, damp, and the
417*ae771770SStanislav Sedov   initial value of bias) need to satisfy the following constraints:
418*ae771770SStanislav Sedov
419*ae771770SStanislav Sedov      0 <= tmin <= tmax <= base-1
420*ae771770SStanislav Sedov      skew >= 1
421*ae771770SStanislav Sedov      damp >= 2
422*ae771770SStanislav Sedov      initial_bias mod base <= base - tmin
423*ae771770SStanislav Sedov
424*ae771770SStanislav Sedov   Provided the constraints are satisfied, these five parameters affect
425*ae771770SStanislav Sedov   efficiency but not correctness.  They are best chosen empirically.
426*ae771770SStanislav Sedov
427*ae771770SStanislav Sedov   If support for mixed-case annotation is desired (see appendix A),
428*ae771770SStanislav Sedov   make sure that the code points corresponding to 0 through tmax-1 all
429*ae771770SStanislav Sedov   have both uppercase and lowercase forms.
430*ae771770SStanislav Sedov
431*ae771770SStanislav Sedov5. Parameter values for Punycode
432*ae771770SStanislav Sedov
433*ae771770SStanislav Sedov   Punycode uses the following Bootstring parameter values:
434*ae771770SStanislav Sedov
435*ae771770SStanislav Sedov      base         = 36
436*ae771770SStanislav Sedov      tmin         = 1
437*ae771770SStanislav Sedov      tmax         = 26
438*ae771770SStanislav Sedov      skew         = 38
439*ae771770SStanislav Sedov      damp         = 700
440*ae771770SStanislav Sedov      initial_bias = 72
441*ae771770SStanislav Sedov      initial_n    = 128 = 0x80
442*ae771770SStanislav Sedov
443*ae771770SStanislav Sedov   Although the only restriction Punycode imposes on the input integers
444*ae771770SStanislav Sedov   is that they be nonnegative, these parameters are especially designed
445*ae771770SStanislav Sedov   to work well with Unicode [UNICODE] code points, which are integers
446*ae771770SStanislav Sedov   in the range 0..10FFFF (but not D800..DFFF, which are reserved for
447*ae771770SStanislav Sedov
448*ae771770SStanislav Sedov
449*ae771770SStanislav Sedov
450*ae771770SStanislav SedovCostello                    Standards Track                     [Page 8]
451*ae771770SStanislav Sedov
452*ae771770SStanislav SedovRFC 3492                     IDNA Punycode                    March 2003
453*ae771770SStanislav Sedov
454*ae771770SStanislav Sedov
455*ae771770SStanislav Sedov   use by the UTF-16 encoding of Unicode).  The basic code points are
456*ae771770SStanislav Sedov   the ASCII [ASCII] code points (0..7F), of which U+002D (-) is the
457*ae771770SStanislav Sedov   delimiter, and some of the others have digit-values as follows:
458*ae771770SStanislav Sedov
459*ae771770SStanislav Sedov      code points    digit-values
460*ae771770SStanislav Sedov      ------------   ----------------------
461*ae771770SStanislav Sedov      41..5A (A-Z) =  0 to 25, respectively
462*ae771770SStanislav Sedov      61..7A (a-z) =  0 to 25, respectively
463*ae771770SStanislav Sedov      30..39 (0-9) = 26 to 35, respectively
464*ae771770SStanislav Sedov
465*ae771770SStanislav Sedov   Using hyphen-minus as the delimiter implies that the encoded string
466*ae771770SStanislav Sedov   can end with a hyphen-minus only if the Unicode string consists
467*ae771770SStanislav Sedov   entirely of basic code points, but IDNA forbids such strings from
468*ae771770SStanislav Sedov   being encoded.  The encoded string can begin with a hyphen-minus, but
469*ae771770SStanislav Sedov   IDNA prepends a prefix.  Therefore IDNA using Punycode conforms to
470*ae771770SStanislav Sedov   the RFC 952 rule that host name labels neither begin nor end with a
471*ae771770SStanislav Sedov   hyphen-minus [RFC952].
472*ae771770SStanislav Sedov
473*ae771770SStanislav Sedov   A decoder MUST recognize the letters in both uppercase and lowercase
474*ae771770SStanislav Sedov   forms (including mixtures of both forms).  An encoder SHOULD output
475*ae771770SStanislav Sedov   only uppercase forms or only lowercase forms, unless it uses mixed-
476*ae771770SStanislav Sedov   case annotation (see appendix A).
477*ae771770SStanislav Sedov
478*ae771770SStanislav Sedov   Presumably most users will not manually write or type encoded strings
479*ae771770SStanislav Sedov   (as opposed to cutting and pasting them), but those who do will need
480*ae771770SStanislav Sedov   to be alert to the potential visual ambiguity between the following
481*ae771770SStanislav Sedov   sets of characters:
482*ae771770SStanislav Sedov
483*ae771770SStanislav Sedov      G 6
484*ae771770SStanislav Sedov      I l 1
485*ae771770SStanislav Sedov      O 0
486*ae771770SStanislav Sedov      S 5
487*ae771770SStanislav Sedov      U V
488*ae771770SStanislav Sedov      Z 2
489*ae771770SStanislav Sedov
490*ae771770SStanislav Sedov   Such ambiguities are usually resolved by context, but in a Punycode
491*ae771770SStanislav Sedov   encoded string there is no context apparent to humans.
492*ae771770SStanislav Sedov
493*ae771770SStanislav Sedov6. Bootstring algorithms
494*ae771770SStanislav Sedov
495*ae771770SStanislav Sedov   Some parts of the pseudocode can be omitted if the parameters satisfy
496*ae771770SStanislav Sedov   certain conditions (for which Punycode qualifies).  These parts are
497*ae771770SStanislav Sedov   enclosed in {braces}, and notes immediately following the pseudocode
498*ae771770SStanislav Sedov   explain the conditions under which they can be omitted.
499*ae771770SStanislav Sedov
500*ae771770SStanislav Sedov
501*ae771770SStanislav Sedov
502*ae771770SStanislav Sedov
503*ae771770SStanislav Sedov
504*ae771770SStanislav Sedov
505*ae771770SStanislav Sedov
506*ae771770SStanislav SedovCostello                    Standards Track                     [Page 9]
507*ae771770SStanislav Sedov
508*ae771770SStanislav SedovRFC 3492                     IDNA Punycode                    March 2003
509*ae771770SStanislav Sedov
510*ae771770SStanislav Sedov
511*ae771770SStanislav Sedov   Formally, code points are integers, and hence the pseudocode assumes
512*ae771770SStanislav Sedov   that arithmetic operations can be performed directly on code points.
513*ae771770SStanislav Sedov   In some programming languages, explicit conversion between code
514*ae771770SStanislav Sedov   points and integers might be necessary.
515*ae771770SStanislav Sedov
516*ae771770SStanislav Sedov6.1 Bias adaptation function
517*ae771770SStanislav Sedov
518*ae771770SStanislav Sedov   function adapt(delta,numpoints,firsttime):
519*ae771770SStanislav Sedov     if firsttime then let delta = delta div damp
520*ae771770SStanislav Sedov     else let delta = delta div 2
521*ae771770SStanislav Sedov     let delta = delta + (delta div numpoints)
522*ae771770SStanislav Sedov     let k = 0
523*ae771770SStanislav Sedov     while delta > ((base - tmin) * tmax) div 2 do begin
524*ae771770SStanislav Sedov       let delta = delta div (base - tmin)
525*ae771770SStanislav Sedov       let k = k + base
526*ae771770SStanislav Sedov     end
527*ae771770SStanislav Sedov     return k + (((base - tmin + 1) * delta) div (delta + skew))
528*ae771770SStanislav Sedov
529*ae771770SStanislav Sedov   It does not matter whether the modifications to delta and k inside
530*ae771770SStanislav Sedov   adapt() affect variables of the same name inside the
531*ae771770SStanislav Sedov   encoding/decoding procedures, because after calling adapt() the
532*ae771770SStanislav Sedov   caller does not read those variables before overwriting them.
533*ae771770SStanislav Sedov
534*ae771770SStanislav Sedov
535*ae771770SStanislav Sedov
536*ae771770SStanislav Sedov
537*ae771770SStanislav Sedov
538*ae771770SStanislav Sedov
539*ae771770SStanislav Sedov
540*ae771770SStanislav Sedov
541*ae771770SStanislav Sedov
542*ae771770SStanislav Sedov
543*ae771770SStanislav Sedov
544*ae771770SStanislav Sedov
545*ae771770SStanislav Sedov
546*ae771770SStanislav Sedov
547*ae771770SStanislav Sedov
548*ae771770SStanislav Sedov
549*ae771770SStanislav Sedov
550*ae771770SStanislav Sedov
551*ae771770SStanislav Sedov
552*ae771770SStanislav Sedov
553*ae771770SStanislav Sedov
554*ae771770SStanislav Sedov
555*ae771770SStanislav Sedov
556*ae771770SStanislav Sedov
557*ae771770SStanislav Sedov
558*ae771770SStanislav Sedov
559*ae771770SStanislav Sedov
560*ae771770SStanislav Sedov
561*ae771770SStanislav Sedov
562*ae771770SStanislav SedovCostello                    Standards Track                    [Page 10]
563*ae771770SStanislav Sedov
564*ae771770SStanislav SedovRFC 3492                     IDNA Punycode                    March 2003
565*ae771770SStanislav Sedov
566*ae771770SStanislav Sedov
567*ae771770SStanislav Sedov6.2 Decoding procedure
568*ae771770SStanislav Sedov
569*ae771770SStanislav Sedov   let n = initial_n
570*ae771770SStanislav Sedov   let i = 0
571*ae771770SStanislav Sedov   let bias = initial_bias
572*ae771770SStanislav Sedov   let output = an empty string indexed from 0
573*ae771770SStanislav Sedov   consume all code points before the last delimiter (if there is one)
574*ae771770SStanislav Sedov     and copy them to output, fail on any non-basic code point
575*ae771770SStanislav Sedov   if more than zero code points were consumed then consume one more
576*ae771770SStanislav Sedov     (which will be the last delimiter)
577*ae771770SStanislav Sedov   while the input is not exhausted do begin
578*ae771770SStanislav Sedov     let oldi = i
579*ae771770SStanislav Sedov     let w = 1
580*ae771770SStanislav Sedov     for k = base to infinity in steps of base do begin
581*ae771770SStanislav Sedov       consume a code point, or fail if there was none to consume
582*ae771770SStanislav Sedov       let digit = the code point's digit-value, fail if it has none
583*ae771770SStanislav Sedov       let i = i + digit * w, fail on overflow
584*ae771770SStanislav Sedov       let t = tmin if k <= bias {+ tmin}, or
585*ae771770SStanislav Sedov               tmax if k >= bias + tmax, or k - bias otherwise
586*ae771770SStanislav Sedov       if digit < t then break
587*ae771770SStanislav Sedov       let w = w * (base - t), fail on overflow
588*ae771770SStanislav Sedov     end
589*ae771770SStanislav Sedov     let bias = adapt(i - oldi, length(output) + 1, test oldi is 0?)
590*ae771770SStanislav Sedov     let n = n + i div (length(output) + 1), fail on overflow
591*ae771770SStanislav Sedov     let i = i mod (length(output) + 1)
592*ae771770SStanislav Sedov     {if n is a basic code point then fail}
593*ae771770SStanislav Sedov     insert n into output at position i
594*ae771770SStanislav Sedov     increment i
595*ae771770SStanislav Sedov   end
596*ae771770SStanislav Sedov
597*ae771770SStanislav Sedov   The full statement enclosed in braces (checking whether n is a basic
598*ae771770SStanislav Sedov   code point) can be omitted if initial_n exceeds all basic code points
599*ae771770SStanislav Sedov   (which is true for Punycode), because n is never less than initial_n.
600*ae771770SStanislav Sedov
601*ae771770SStanislav Sedov   In the assignment of t, where t is clamped to the range tmin through
602*ae771770SStanislav Sedov   tmax, "+ tmin" can always be omitted.  This makes the clamping
603*ae771770SStanislav Sedov   calculation incorrect when bias < k < bias + tmin, but that cannot
604*ae771770SStanislav Sedov   happen because of the way bias is computed and because of the
605*ae771770SStanislav Sedov   constraints on the parameters.
606*ae771770SStanislav Sedov
607*ae771770SStanislav Sedov   Because the decoder state can only advance monotonically, and there
608*ae771770SStanislav Sedov   is only one representation of any delta, there is therefore only one
609*ae771770SStanislav Sedov   encoded string that can represent a given sequence of integers.  The
610*ae771770SStanislav Sedov   only error conditions are invalid code points, unexpected end-of-
611*ae771770SStanislav Sedov   input, overflow, and basic code points encoded using deltas instead
612*ae771770SStanislav Sedov   of appearing literally.  If the decoder fails on these errors as
613*ae771770SStanislav Sedov   shown above, then it cannot produce the same output for two distinct
614*ae771770SStanislav Sedov   inputs.  Without this property it would have been necessary to re-
615*ae771770SStanislav Sedov
616*ae771770SStanislav Sedov
617*ae771770SStanislav Sedov
618*ae771770SStanislav SedovCostello                    Standards Track                    [Page 11]
619*ae771770SStanislav Sedov
620*ae771770SStanislav SedovRFC 3492                     IDNA Punycode                    March 2003
621*ae771770SStanislav Sedov
622*ae771770SStanislav Sedov
623*ae771770SStanislav Sedov   encode the output and verify that it matches the input in order to
624*ae771770SStanislav Sedov   guarantee the uniqueness of the encoding.
625*ae771770SStanislav Sedov
626*ae771770SStanislav Sedov6.3 Encoding procedure
627*ae771770SStanislav Sedov
628*ae771770SStanislav Sedov   let n = initial_n
629*ae771770SStanislav Sedov   let delta = 0
630*ae771770SStanislav Sedov   let bias = initial_bias
631*ae771770SStanislav Sedov   let h = b = the number of basic code points in the input
632*ae771770SStanislav Sedov   copy them to the output in order, followed by a delimiter if b > 0
633*ae771770SStanislav Sedov   {if the input contains a non-basic code point < n then fail}
634*ae771770SStanislav Sedov   while h < length(input) do begin
635*ae771770SStanislav Sedov     let m = the minimum {non-basic} code point >= n in the input
636*ae771770SStanislav Sedov     let delta = delta + (m - n) * (h + 1), fail on overflow
637*ae771770SStanislav Sedov     let n = m
638*ae771770SStanislav Sedov     for each code point c in the input (in order) do begin
639*ae771770SStanislav Sedov       if c < n {or c is basic} then increment delta, fail on overflow
640*ae771770SStanislav Sedov       if c == n then begin
641*ae771770SStanislav Sedov         let q = delta
642*ae771770SStanislav Sedov         for k = base to infinity in steps of base do begin
643*ae771770SStanislav Sedov           let t = tmin if k <= bias {+ tmin}, or
644*ae771770SStanislav Sedov                   tmax if k >= bias + tmax, or k - bias otherwise
645*ae771770SStanislav Sedov           if q < t then break
646*ae771770SStanislav Sedov           output the code point for digit t + ((q - t) mod (base - t))
647*ae771770SStanislav Sedov           let q = (q - t) div (base - t)
648*ae771770SStanislav Sedov         end
649*ae771770SStanislav Sedov         output the code point for digit q
650*ae771770SStanislav Sedov         let bias = adapt(delta, h + 1, test h equals b?)
651*ae771770SStanislav Sedov         let delta = 0
652*ae771770SStanislav Sedov         increment h
653*ae771770SStanislav Sedov       end
654*ae771770SStanislav Sedov     end
655*ae771770SStanislav Sedov     increment delta and n
656*ae771770SStanislav Sedov   end
657*ae771770SStanislav Sedov
658*ae771770SStanislav Sedov   The full statement enclosed in braces (checking whether the input
659*ae771770SStanislav Sedov   contains a non-basic code point less than n) can be omitted if all
660*ae771770SStanislav Sedov   code points less than initial_n are basic code points (which is true
661*ae771770SStanislav Sedov   for Punycode if code points are unsigned).
662*ae771770SStanislav Sedov
663*ae771770SStanislav Sedov   The brace-enclosed conditions "non-basic" and "or c is basic" can be
664*ae771770SStanislav Sedov   omitted if initial_n exceeds all basic code points (which is true for
665*ae771770SStanislav Sedov   Punycode), because the code point being tested is never less than
666*ae771770SStanislav Sedov   initial_n.
667*ae771770SStanislav Sedov
668*ae771770SStanislav Sedov   In the assignment of t, where t is clamped to the range tmin through
669*ae771770SStanislav Sedov   tmax, "+ tmin" can always be omitted.  This makes the clamping
670*ae771770SStanislav Sedov   calculation incorrect when bias < k < bias + tmin, but that cannot
671*ae771770SStanislav Sedov
672*ae771770SStanislav Sedov
673*ae771770SStanislav Sedov
674*ae771770SStanislav SedovCostello                    Standards Track                    [Page 12]
675*ae771770SStanislav Sedov
676*ae771770SStanislav SedovRFC 3492                     IDNA Punycode                    March 2003
677*ae771770SStanislav Sedov
678*ae771770SStanislav Sedov
679*ae771770SStanislav Sedov   happen because of the way bias is computed and because of the
680*ae771770SStanislav Sedov   constraints on the parameters.
681*ae771770SStanislav Sedov
682*ae771770SStanislav Sedov   The checks for overflow are necessary to avoid producing invalid
683*ae771770SStanislav Sedov   output when the input contains very large values or is very long.
684*ae771770SStanislav Sedov
685*ae771770SStanislav Sedov   The increment of delta at the bottom of the outer loop cannot
686*ae771770SStanislav Sedov   overflow because delta < length(input) before the increment, and
687*ae771770SStanislav Sedov   length(input) is already assumed to be representable.  The increment
688*ae771770SStanislav Sedov   of n could overflow, but only if h == length(input), in which case
689*ae771770SStanislav Sedov   the procedure is finished anyway.
690*ae771770SStanislav Sedov
691*ae771770SStanislav Sedov6.4 Overflow handling
692*ae771770SStanislav Sedov
693*ae771770SStanislav Sedov   For IDNA, 26-bit unsigned integers are sufficient to handle all valid
694*ae771770SStanislav Sedov   IDNA labels without overflow, because any string that needed a 27-bit
695*ae771770SStanislav Sedov   delta would have to exceed either the code point limit (0..10FFFF) or
696*ae771770SStanislav Sedov   the label length limit (63 characters).  However, overflow handling
697*ae771770SStanislav Sedov   is necessary because the inputs are not necessarily valid IDNA
698*ae771770SStanislav Sedov   labels.
699*ae771770SStanislav Sedov
700*ae771770SStanislav Sedov   If the programming language does not provide overflow detection, the
701*ae771770SStanislav Sedov   following technique can be used.  Suppose A, B, and C are
702*ae771770SStanislav Sedov   representable nonnegative integers and C is nonzero.  Then A + B
703*ae771770SStanislav Sedov   overflows if and only if B > maxint - A, and A + (B * C) overflows if
704*ae771770SStanislav Sedov   and only if B > (maxint - A) div C, where maxint is the greatest
705*ae771770SStanislav Sedov   integer for which maxint + 1 cannot be represented.  Refer to
706*ae771770SStanislav Sedov   appendix C "Punycode sample implementation" for demonstrations of
707*ae771770SStanislav Sedov   this technique in the C language.
708*ae771770SStanislav Sedov
709*ae771770SStanislav Sedov   The decoding and encoding algorithms shown in sections 6.2 and 6.3
710*ae771770SStanislav Sedov   handle overflow by detecting it whenever it happens.  Another
711*ae771770SStanislav Sedov   approach is to enforce limits on the inputs that prevent overflow
712*ae771770SStanislav Sedov   from happening.  For example, if the encoder were to verify that no
713*ae771770SStanislav Sedov   input code points exceed M and that the input length does not exceed
714*ae771770SStanislav Sedov   L, then no delta could ever exceed (M - initial_n) * (L + 1), and
715*ae771770SStanislav Sedov   hence no overflow could occur if integer variables were capable of
716*ae771770SStanislav Sedov   representing values that large.  This prevention approach would
717*ae771770SStanislav Sedov   impose more restrictions on the input than the detection approach
718*ae771770SStanislav Sedov   does, but might be considered simpler in some programming languages.
719*ae771770SStanislav Sedov
720*ae771770SStanislav Sedov   In theory, the decoder could use an analogous approach, limiting the
721*ae771770SStanislav Sedov   number of digits in a variable-length integer (that is, limiting the
722*ae771770SStanislav Sedov   number of iterations in the innermost loop).  However, the number of
723*ae771770SStanislav Sedov   digits that suffice to represent a given delta can sometimes
724*ae771770SStanislav Sedov   represent much larger deltas (because of the adaptation), and hence
725*ae771770SStanislav Sedov   this approach would probably need integers wider than 32 bits.
726*ae771770SStanislav Sedov
727*ae771770SStanislav Sedov
728*ae771770SStanislav Sedov
729*ae771770SStanislav Sedov
730*ae771770SStanislav SedovCostello                    Standards Track                    [Page 13]
731*ae771770SStanislav Sedov
732*ae771770SStanislav SedovRFC 3492                     IDNA Punycode                    March 2003
733*ae771770SStanislav Sedov
734*ae771770SStanislav Sedov
735*ae771770SStanislav Sedov   Yet another approach for the decoder is to allow overflow to occur,
736*ae771770SStanislav Sedov   but to check the final output string by re-encoding it and comparing
737*ae771770SStanislav Sedov   to the decoder input.  If and only if they do not match (using a
738*ae771770SStanislav Sedov   case-insensitive ASCII comparison) overflow has occurred.  This
739*ae771770SStanislav Sedov   delayed-detection approach would not impose any more restrictions on
740*ae771770SStanislav Sedov   the input than the immediate-detection approach does, and might be
741*ae771770SStanislav Sedov   considered simpler in some programming languages.
742*ae771770SStanislav Sedov
743*ae771770SStanislav Sedov   In fact, if the decoder is used only inside the IDNA ToUnicode
744*ae771770SStanislav Sedov   operation [IDNA], then it need not check for overflow at all, because
745*ae771770SStanislav Sedov   ToUnicode performs a higher level re-encoding and comparison, and a
746*ae771770SStanislav Sedov   mismatch has the same consequence as if the Punycode decoder had
747*ae771770SStanislav Sedov   failed.
748*ae771770SStanislav Sedov
749*ae771770SStanislav Sedov7. Punycode examples
750*ae771770SStanislav Sedov
751*ae771770SStanislav Sedov7.1 Sample strings
752*ae771770SStanislav Sedov
753*ae771770SStanislav Sedov   In the Punycode encodings below, the ACE prefix is not shown.
754*ae771770SStanislav Sedov   Backslashes show where line breaks have been inserted in strings too
755*ae771770SStanislav Sedov   long for one line.
756*ae771770SStanislav Sedov
757*ae771770SStanislav Sedov   The first several examples are all translations of the sentence "Why
758*ae771770SStanislav Sedov   can't they just speak in <language>?" (courtesy of Michael Kaplan's
759*ae771770SStanislav Sedov   "provincial" page [PROVINCIAL]).  Word breaks and punctuation have
760*ae771770SStanislav Sedov   been removed, as is often done in domain names.
761*ae771770SStanislav Sedov
762*ae771770SStanislav Sedov   (A) Arabic (Egyptian):
763*ae771770SStanislav Sedov       u+0644 u+064A u+0647 u+0645 u+0627 u+0628 u+062A u+0643 u+0644
764*ae771770SStanislav Sedov       u+0645 u+0648 u+0634 u+0639 u+0631 u+0628 u+064A u+061F
765*ae771770SStanislav Sedov       Punycode: egbpdaj6bu4bxfgehfvwxn
766*ae771770SStanislav Sedov
767*ae771770SStanislav Sedov   (B) Chinese (simplified):
768*ae771770SStanislav Sedov       u+4ED6 u+4EEC u+4E3A u+4EC0 u+4E48 u+4E0D u+8BF4 u+4E2D u+6587
769*ae771770SStanislav Sedov       Punycode: ihqwcrb4cv8a8dqg056pqjye
770*ae771770SStanislav Sedov
771*ae771770SStanislav Sedov   (C) Chinese (traditional):
772*ae771770SStanislav Sedov       u+4ED6 u+5011 u+7232 u+4EC0 u+9EBD u+4E0D u+8AAA u+4E2D u+6587
773*ae771770SStanislav Sedov       Punycode: ihqwctvzc91f659drss3x8bo0yb
774*ae771770SStanislav Sedov
775*ae771770SStanislav Sedov   (D) Czech: Pro<ccaron>prost<ecaron>nemluv<iacute><ccaron>esky
776*ae771770SStanislav Sedov       U+0050 u+0072 u+006F u+010D u+0070 u+0072 u+006F u+0073 u+0074
777*ae771770SStanislav Sedov       u+011B u+006E u+0065 u+006D u+006C u+0075 u+0076 u+00ED u+010D
778*ae771770SStanislav Sedov       u+0065 u+0073 u+006B u+0079
779*ae771770SStanislav Sedov       Punycode: Proprostnemluvesky-uyb24dma41a
780*ae771770SStanislav Sedov
781*ae771770SStanislav Sedov
782*ae771770SStanislav Sedov
783*ae771770SStanislav Sedov
784*ae771770SStanislav Sedov
785*ae771770SStanislav Sedov
786*ae771770SStanislav SedovCostello                    Standards Track                    [Page 14]
787*ae771770SStanislav Sedov
788*ae771770SStanislav SedovRFC 3492                     IDNA Punycode                    March 2003
789*ae771770SStanislav Sedov
790*ae771770SStanislav Sedov
791*ae771770SStanislav Sedov   (E) Hebrew:
792*ae771770SStanislav Sedov       u+05DC u+05DE u+05D4 u+05D4 u+05DD u+05E4 u+05E9 u+05D5 u+05D8
793*ae771770SStanislav Sedov       u+05DC u+05D0 u+05DE u+05D3 u+05D1 u+05E8 u+05D9 u+05DD u+05E2
794*ae771770SStanislav Sedov       u+05D1 u+05E8 u+05D9 u+05EA
795*ae771770SStanislav Sedov       Punycode: 4dbcagdahymbxekheh6e0a7fei0b
796*ae771770SStanislav Sedov
797*ae771770SStanislav Sedov   (F) Hindi (Devanagari):
798*ae771770SStanislav Sedov       u+092F u+0939 u+0932 u+094B u+0917 u+0939 u+093F u+0928 u+094D
799*ae771770SStanislav Sedov       u+0926 u+0940 u+0915 u+094D u+092F u+094B u+0902 u+0928 u+0939
800*ae771770SStanislav Sedov       u+0940 u+0902 u+092C u+094B u+0932 u+0938 u+0915 u+0924 u+0947
801*ae771770SStanislav Sedov       u+0939 u+0948 u+0902
802*ae771770SStanislav Sedov       Punycode: i1baa7eci9glrd9b2ae1bj0hfcgg6iyaf8o0a1dig0cd
803*ae771770SStanislav Sedov
804*ae771770SStanislav Sedov   (G) Japanese (kanji and hiragana):
805*ae771770SStanislav Sedov       u+306A u+305C u+307F u+3093 u+306A u+65E5 u+672C u+8A9E u+3092
806*ae771770SStanislav Sedov       u+8A71 u+3057 u+3066 u+304F u+308C u+306A u+3044 u+306E u+304B
807*ae771770SStanislav Sedov       Punycode: n8jok5ay5dzabd5bym9f0cm5685rrjetr6pdxa
808*ae771770SStanislav Sedov
809*ae771770SStanislav Sedov   (H) Korean (Hangul syllables):
810*ae771770SStanislav Sedov       u+C138 u+ACC4 u+C758 u+BAA8 u+B4E0 u+C0AC u+B78C u+B4E4 u+C774
811*ae771770SStanislav Sedov       u+D55C u+AD6D u+C5B4 u+B97C u+C774 u+D574 u+D55C u+B2E4 u+BA74
812*ae771770SStanislav Sedov       u+C5BC u+B9C8 u+B098 u+C88B u+C744 u+AE4C
813*ae771770SStanislav Sedov       Punycode: 989aomsvi5e83db1d2a355cv1e0vak1dwrv93d5xbh15a0dt30a5j\
814*ae771770SStanislav Sedov                 psd879ccm6fea98c
815*ae771770SStanislav Sedov
816*ae771770SStanislav Sedov   (I) Russian (Cyrillic):
817*ae771770SStanislav Sedov       U+043F u+043E u+0447 u+0435 u+043C u+0443 u+0436 u+0435 u+043E
818*ae771770SStanislav Sedov       u+043D u+0438 u+043D u+0435 u+0433 u+043E u+0432 u+043E u+0440
819*ae771770SStanislav Sedov       u+044F u+0442 u+043F u+043E u+0440 u+0443 u+0441 u+0441 u+043A
820*ae771770SStanislav Sedov       u+0438
821*ae771770SStanislav Sedov       Punycode: b1abfaaepdrnnbgefbaDotcwatmq2g4l
822*ae771770SStanislav Sedov
823*ae771770SStanislav Sedov   (J) Spanish: Porqu<eacute>nopuedensimplementehablarenEspa<ntilde>ol
824*ae771770SStanislav Sedov       U+0050 u+006F u+0072 u+0071 u+0075 u+00E9 u+006E u+006F u+0070
825*ae771770SStanislav Sedov       u+0075 u+0065 u+0064 u+0065 u+006E u+0073 u+0069 u+006D u+0070
826*ae771770SStanislav Sedov       u+006C u+0065 u+006D u+0065 u+006E u+0074 u+0065 u+0068 u+0061
827*ae771770SStanislav Sedov       u+0062 u+006C u+0061 u+0072 u+0065 u+006E U+0045 u+0073 u+0070
828*ae771770SStanislav Sedov       u+0061 u+00F1 u+006F u+006C
829*ae771770SStanislav Sedov       Punycode: PorqunopuedensimplementehablarenEspaol-fmd56a
830*ae771770SStanislav Sedov
831*ae771770SStanislav Sedov   (K) Vietnamese:
832*ae771770SStanislav Sedov       T<adotbelow>isaoh<odotbelow>kh<ocirc>ngth<ecirchookabove>ch\
833*ae771770SStanislav Sedov       <ihookabove>n<oacute>iti<ecircacute>ngVi<ecircdotbelow>t
834*ae771770SStanislav Sedov       U+0054 u+1EA1 u+0069 u+0073 u+0061 u+006F u+0068 u+1ECD u+006B
835*ae771770SStanislav Sedov       u+0068 u+00F4 u+006E u+0067 u+0074 u+0068 u+1EC3 u+0063 u+0068
836*ae771770SStanislav Sedov       u+1EC9 u+006E u+00F3 u+0069 u+0074 u+0069 u+1EBF u+006E u+0067
837*ae771770SStanislav Sedov       U+0056 u+0069 u+1EC7 u+0074
838*ae771770SStanislav Sedov       Punycode: TisaohkhngthchnitingVit-kjcr8268qyxafd2f1b9g
839*ae771770SStanislav Sedov
840*ae771770SStanislav Sedov
841*ae771770SStanislav Sedov
842*ae771770SStanislav SedovCostello                    Standards Track                    [Page 15]
843*ae771770SStanislav Sedov
844*ae771770SStanislav SedovRFC 3492                     IDNA Punycode                    March 2003
845*ae771770SStanislav Sedov
846*ae771770SStanislav Sedov
847*ae771770SStanislav Sedov   The next several examples are all names of Japanese music artists,
848*ae771770SStanislav Sedov   song titles, and TV programs, just because the author happens to have
849*ae771770SStanislav Sedov   them handy (but Japanese is useful for providing examples of single-
850*ae771770SStanislav Sedov   row text, two-row text, ideographic text, and various mixtures
851*ae771770SStanislav Sedov   thereof).
852*ae771770SStanislav Sedov
853*ae771770SStanislav Sedov   (L) 3<nen>B<gumi><kinpachi><sensei>
854*ae771770SStanislav Sedov       u+0033 u+5E74 U+0042 u+7D44 u+91D1 u+516B u+5148 u+751F
855*ae771770SStanislav Sedov       Punycode: 3B-ww4c5e180e575a65lsy2b
856*ae771770SStanislav Sedov
857*ae771770SStanislav Sedov   (M) <amuro><namie>-with-SUPER-MONKEYS
858*ae771770SStanislav Sedov       u+5B89 u+5BA4 u+5948 u+7F8E u+6075 u+002D u+0077 u+0069 u+0074
859*ae771770SStanislav Sedov       u+0068 u+002D U+0053 U+0055 U+0050 U+0045 U+0052 u+002D U+004D
860*ae771770SStanislav Sedov       U+004F U+004E U+004B U+0045 U+0059 U+0053
861*ae771770SStanislav Sedov       Punycode: -with-SUPER-MONKEYS-pc58ag80a8qai00g7n9n
862*ae771770SStanislav Sedov
863*ae771770SStanislav Sedov   (N) Hello-Another-Way-<sorezore><no><basho>
864*ae771770SStanislav Sedov       U+0048 u+0065 u+006C u+006C u+006F u+002D U+0041 u+006E u+006F
865*ae771770SStanislav Sedov       u+0074 u+0068 u+0065 u+0072 u+002D U+0057 u+0061 u+0079 u+002D
866*ae771770SStanislav Sedov       u+305D u+308C u+305E u+308C u+306E u+5834 u+6240
867*ae771770SStanislav Sedov       Punycode: Hello-Another-Way--fc4qua05auwb3674vfr0b
868*ae771770SStanislav Sedov
869*ae771770SStanislav Sedov   (O) <hitotsu><yane><no><shita>2
870*ae771770SStanislav Sedov       u+3072 u+3068 u+3064 u+5C4B u+6839 u+306E u+4E0B u+0032
871*ae771770SStanislav Sedov       Punycode: 2-u9tlzr9756bt3uc0v
872*ae771770SStanislav Sedov
873*ae771770SStanislav Sedov   (P) Maji<de>Koi<suru>5<byou><mae>
874*ae771770SStanislav Sedov       U+004D u+0061 u+006A u+0069 u+3067 U+004B u+006F u+0069 u+3059
875*ae771770SStanislav Sedov       u+308B u+0035 u+79D2 u+524D
876*ae771770SStanislav Sedov       Punycode: MajiKoi5-783gue6qz075azm5e
877*ae771770SStanislav Sedov
878*ae771770SStanislav Sedov   (Q) <pafii>de<runba>
879*ae771770SStanislav Sedov       u+30D1 u+30D5 u+30A3 u+30FC u+0064 u+0065 u+30EB u+30F3 u+30D0
880*ae771770SStanislav Sedov       Punycode: de-jg4avhby1noc0d
881*ae771770SStanislav Sedov
882*ae771770SStanislav Sedov   (R) <sono><supiido><de>
883*ae771770SStanislav Sedov       u+305D u+306E u+30B9 u+30D4 u+30FC u+30C9 u+3067
884*ae771770SStanislav Sedov       Punycode: d9juau41awczczp
885*ae771770SStanislav Sedov
886*ae771770SStanislav Sedov   The last example is an ASCII string that breaks the existing rules
887*ae771770SStanislav Sedov   for host name labels.  (It is not a realistic example for IDNA,
888*ae771770SStanislav Sedov   because IDNA never encodes pure ASCII labels.)
889*ae771770SStanislav Sedov
890*ae771770SStanislav Sedov   (S) -> $1.00 <-
891*ae771770SStanislav Sedov       u+002D u+003E u+0020 u+0024 u+0031 u+002E u+0030 u+0030 u+0020
892*ae771770SStanislav Sedov       u+003C u+002D
893*ae771770SStanislav Sedov       Punycode: -> $1.00 <--
894*ae771770SStanislav Sedov
895*ae771770SStanislav Sedov
896*ae771770SStanislav Sedov
897*ae771770SStanislav Sedov
898*ae771770SStanislav SedovCostello                    Standards Track                    [Page 16]
899*ae771770SStanislav Sedov
900*ae771770SStanislav SedovRFC 3492                     IDNA Punycode                    March 2003
901*ae771770SStanislav Sedov
902*ae771770SStanislav Sedov
903*ae771770SStanislav Sedov7.2 Decoding traces
904*ae771770SStanislav Sedov
905*ae771770SStanislav Sedov   In the following traces, the evolving state of the decoder is shown
906*ae771770SStanislav Sedov   as a sequence of hexadecimal values, representing the code points in
907*ae771770SStanislav Sedov   the extended string.  An asterisk appears just after the most
908*ae771770SStanislav Sedov   recently inserted code point, indicating both n (the value preceeding
909*ae771770SStanislav Sedov   the asterisk) and i (the position of the value just after the
910*ae771770SStanislav Sedov   asterisk).  Other numerical values are decimal.
911*ae771770SStanislav Sedov
912*ae771770SStanislav Sedov   Decoding trace of example B from section 7.1:
913*ae771770SStanislav Sedov
914*ae771770SStanislav Sedov   n is 128, i is 0, bias is 72
915*ae771770SStanislav Sedov   input is "ihqwcrb4cv8a8dqg056pqjye"
916*ae771770SStanislav Sedov   there is no delimiter, so extended string starts empty
917*ae771770SStanislav Sedov   delta "ihq" decodes to 19853
918*ae771770SStanislav Sedov   bias becomes 21
919*ae771770SStanislav Sedov   4E0D *
920*ae771770SStanislav Sedov   delta "wc" decodes to 64
921*ae771770SStanislav Sedov   bias becomes 20
922*ae771770SStanislav Sedov   4E0D 4E2D *
923*ae771770SStanislav Sedov   delta "rb" decodes to 37
924*ae771770SStanislav Sedov   bias becomes 13
925*ae771770SStanislav Sedov   4E3A * 4E0D 4E2D
926*ae771770SStanislav Sedov   delta "4c" decodes to 56
927*ae771770SStanislav Sedov   bias becomes 17
928*ae771770SStanislav Sedov   4E3A 4E48 * 4E0D 4E2D
929*ae771770SStanislav Sedov   delta "v8a" decodes to 599
930*ae771770SStanislav Sedov   bias becomes 32
931*ae771770SStanislav Sedov   4E3A 4EC0 * 4E48 4E0D 4E2D
932*ae771770SStanislav Sedov   delta "8d" decodes to 130
933*ae771770SStanislav Sedov   bias becomes 23
934*ae771770SStanislav Sedov   4ED6 * 4E3A 4EC0 4E48 4E0D 4E2D
935*ae771770SStanislav Sedov   delta "qg" decodes to 154
936*ae771770SStanislav Sedov   bias becomes 25
937*ae771770SStanislav Sedov   4ED6 4EEC * 4E3A 4EC0 4E48 4E0D 4E2D
938*ae771770SStanislav Sedov   delta "056p" decodes to 46301
939*ae771770SStanislav Sedov   bias becomes 84
940*ae771770SStanislav Sedov   4ED6 4EEC 4E3A 4EC0 4E48 4E0D 4E2D 6587 *
941*ae771770SStanislav Sedov   delta "qjye" decodes to 88531
942*ae771770SStanislav Sedov   bias becomes 90
943*ae771770SStanislav Sedov   4ED6 4EEC 4E3A 4EC0 4E48 4E0D 8BF4 * 4E2D 6587
944*ae771770SStanislav Sedov
945*ae771770SStanislav Sedov
946*ae771770SStanislav Sedov
947*ae771770SStanislav Sedov
948*ae771770SStanislav Sedov
949*ae771770SStanislav Sedov
950*ae771770SStanislav Sedov
951*ae771770SStanislav Sedov
952*ae771770SStanislav Sedov
953*ae771770SStanislav Sedov
954*ae771770SStanislav SedovCostello                    Standards Track                    [Page 17]
955*ae771770SStanislav Sedov
956*ae771770SStanislav SedovRFC 3492                     IDNA Punycode                    March 2003
957*ae771770SStanislav Sedov
958*ae771770SStanislav Sedov
959*ae771770SStanislav Sedov   Decoding trace of example L from section 7.1:
960*ae771770SStanislav Sedov
961*ae771770SStanislav Sedov   n is 128, i is 0, bias is 72
962*ae771770SStanislav Sedov   input is "3B-ww4c5e180e575a65lsy2b"
963*ae771770SStanislav Sedov   literal portion is "3B-", so extended string starts as:
964*ae771770SStanislav Sedov   0033 0042
965*ae771770SStanislav Sedov   delta "ww4c" decodes to 62042
966*ae771770SStanislav Sedov   bias becomes 27
967*ae771770SStanislav Sedov   0033 0042 5148 *
968*ae771770SStanislav Sedov   delta "5e" decodes to 139
969*ae771770SStanislav Sedov   bias becomes 24
970*ae771770SStanislav Sedov   0033 0042 516B * 5148
971*ae771770SStanislav Sedov   delta "180e" decodes to 16683
972*ae771770SStanislav Sedov   bias becomes 67
973*ae771770SStanislav Sedov   0033 5E74 * 0042 516B 5148
974*ae771770SStanislav Sedov   delta "575a" decodes to 34821
975*ae771770SStanislav Sedov   bias becomes 82
976*ae771770SStanislav Sedov   0033 5E74 0042 516B 5148 751F *
977*ae771770SStanislav Sedov   delta "65l" decodes to 14592
978*ae771770SStanislav Sedov   bias becomes 67
979*ae771770SStanislav Sedov   0033 5E74 0042 7D44 * 516B 5148 751F
980*ae771770SStanislav Sedov   delta "sy2b" decodes to 42088
981*ae771770SStanislav Sedov   bias becomes 84
982*ae771770SStanislav Sedov   0033 5E74 0042 7D44 91D1 * 516B 5148 751F
983*ae771770SStanislav Sedov
984*ae771770SStanislav Sedov
985*ae771770SStanislav Sedov
986*ae771770SStanislav Sedov
987*ae771770SStanislav Sedov
988*ae771770SStanislav Sedov
989*ae771770SStanislav Sedov
990*ae771770SStanislav Sedov
991*ae771770SStanislav Sedov
992*ae771770SStanislav Sedov
993*ae771770SStanislav Sedov
994*ae771770SStanislav Sedov
995*ae771770SStanislav Sedov
996*ae771770SStanislav Sedov
997*ae771770SStanislav Sedov
998*ae771770SStanislav Sedov
999*ae771770SStanislav Sedov
1000*ae771770SStanislav Sedov
1001*ae771770SStanislav Sedov
1002*ae771770SStanislav Sedov
1003*ae771770SStanislav Sedov
1004*ae771770SStanislav Sedov
1005*ae771770SStanislav Sedov
1006*ae771770SStanislav Sedov
1007*ae771770SStanislav Sedov
1008*ae771770SStanislav Sedov
1009*ae771770SStanislav Sedov
1010*ae771770SStanislav SedovCostello                    Standards Track                    [Page 18]
1011*ae771770SStanislav Sedov
1012*ae771770SStanislav SedovRFC 3492                     IDNA Punycode                    March 2003
1013*ae771770SStanislav Sedov
1014*ae771770SStanislav Sedov
1015*ae771770SStanislav Sedov7.3 Encoding traces
1016*ae771770SStanislav Sedov
1017*ae771770SStanislav Sedov   In the following traces, code point values are hexadecimal, while
1018*ae771770SStanislav Sedov   other numerical values are decimal.
1019*ae771770SStanislav Sedov
1020*ae771770SStanislav Sedov   Encoding trace of example B from section 7.1:
1021*ae771770SStanislav Sedov
1022*ae771770SStanislav Sedov   bias is 72
1023*ae771770SStanislav Sedov   input is:
1024*ae771770SStanislav Sedov   4ED6 4EEC 4E3A 4EC0 4E48 4E0D 8BF4 4E2D 6587
1025*ae771770SStanislav Sedov   there are no basic code points, so no literal portion
1026*ae771770SStanislav Sedov   next code point to insert is 4E0D
1027*ae771770SStanislav Sedov   needed delta is 19853, encodes as "ihq"
1028*ae771770SStanislav Sedov   bias becomes 21
1029*ae771770SStanislav Sedov   next code point to insert is 4E2D
1030*ae771770SStanislav Sedov   needed delta is 64, encodes as "wc"
1031*ae771770SStanislav Sedov   bias becomes 20
1032*ae771770SStanislav Sedov   next code point to insert is 4E3A
1033*ae771770SStanislav Sedov   needed delta is 37, encodes as "rb"
1034*ae771770SStanislav Sedov   bias becomes 13
1035*ae771770SStanislav Sedov   next code point to insert is 4E48
1036*ae771770SStanislav Sedov   needed delta is 56, encodes as "4c"
1037*ae771770SStanislav Sedov   bias becomes 17
1038*ae771770SStanislav Sedov   next code point to insert is 4EC0
1039*ae771770SStanislav Sedov   needed delta is 599, encodes as "v8a"
1040*ae771770SStanislav Sedov   bias becomes 32
1041*ae771770SStanislav Sedov   next code point to insert is 4ED6
1042*ae771770SStanislav Sedov   needed delta is 130, encodes as "8d"
1043*ae771770SStanislav Sedov   bias becomes 23
1044*ae771770SStanislav Sedov   next code point to insert is 4EEC
1045*ae771770SStanislav Sedov   needed delta is 154, encodes as "qg"
1046*ae771770SStanislav Sedov   bias becomes 25
1047*ae771770SStanislav Sedov   next code point to insert is 6587
1048*ae771770SStanislav Sedov   needed delta is 46301, encodes as "056p"
1049*ae771770SStanislav Sedov   bias becomes 84
1050*ae771770SStanislav Sedov   next code point to insert is 8BF4
1051*ae771770SStanislav Sedov   needed delta is 88531, encodes as "qjye"
1052*ae771770SStanislav Sedov   bias becomes 90
1053*ae771770SStanislav Sedov   output is "ihqwcrb4cv8a8dqg056pqjye"
1054*ae771770SStanislav Sedov
1055*ae771770SStanislav Sedov
1056*ae771770SStanislav Sedov
1057*ae771770SStanislav Sedov
1058*ae771770SStanislav Sedov
1059*ae771770SStanislav Sedov
1060*ae771770SStanislav Sedov
1061*ae771770SStanislav Sedov
1062*ae771770SStanislav Sedov
1063*ae771770SStanislav Sedov
1064*ae771770SStanislav Sedov
1065*ae771770SStanislav Sedov
1066*ae771770SStanislav SedovCostello                    Standards Track                    [Page 19]
1067*ae771770SStanislav Sedov
1068*ae771770SStanislav SedovRFC 3492                     IDNA Punycode                    March 2003
1069*ae771770SStanislav Sedov
1070*ae771770SStanislav Sedov
1071*ae771770SStanislav Sedov   Encoding trace of example L from section 7.1:
1072*ae771770SStanislav Sedov
1073*ae771770SStanislav Sedov   bias is 72
1074*ae771770SStanislav Sedov   input is:
1075*ae771770SStanislav Sedov   0033 5E74 0042 7D44 91D1 516B 5148 751F
1076*ae771770SStanislav Sedov   basic code points (0033, 0042) are copied to literal portion: "3B-"
1077*ae771770SStanislav Sedov   next code point to insert is 5148
1078*ae771770SStanislav Sedov   needed delta is 62042, encodes as "ww4c"
1079*ae771770SStanislav Sedov   bias becomes 27
1080*ae771770SStanislav Sedov   next code point to insert is 516B
1081*ae771770SStanislav Sedov   needed delta is 139, encodes as "5e"
1082*ae771770SStanislav Sedov   bias becomes 24
1083*ae771770SStanislav Sedov   next code point to insert is 5E74
1084*ae771770SStanislav Sedov   needed delta is 16683, encodes as "180e"
1085*ae771770SStanislav Sedov   bias becomes 67
1086*ae771770SStanislav Sedov   next code point to insert is 751F
1087*ae771770SStanislav Sedov   needed delta is 34821, encodes as "575a"
1088*ae771770SStanislav Sedov   bias becomes 82
1089*ae771770SStanislav Sedov   next code point to insert is 7D44
1090*ae771770SStanislav Sedov   needed delta is 14592, encodes as "65l"
1091*ae771770SStanislav Sedov   bias becomes 67
1092*ae771770SStanislav Sedov   next code point to insert is 91D1
1093*ae771770SStanislav Sedov   needed delta is 42088, encodes as "sy2b"
1094*ae771770SStanislav Sedov   bias becomes 84
1095*ae771770SStanislav Sedov   output is "3B-ww4c5e180e575a65lsy2b"
1096*ae771770SStanislav Sedov
1097*ae771770SStanislav Sedov8. Security Considerations
1098*ae771770SStanislav Sedov
1099*ae771770SStanislav Sedov   Users expect each domain name in DNS to be controlled by a single
1100*ae771770SStanislav Sedov   authority.  If a Unicode string intended for use as a domain label
1101*ae771770SStanislav Sedov   could map to multiple ACE labels, then an internationalized domain
1102*ae771770SStanislav Sedov   name could map to multiple ASCII domain names, each controlled by a
1103*ae771770SStanislav Sedov   different authority, some of which could be spoofs that hijack
1104*ae771770SStanislav Sedov   service requests intended for another.  Therefore Punycode is
1105*ae771770SStanislav Sedov   designed so that each Unicode string has a unique encoding.
1106*ae771770SStanislav Sedov
1107*ae771770SStanislav Sedov   However, there can still be multiple Unicode representations of the
1108*ae771770SStanislav Sedov   "same" text, for various definitions of "same".  This problem is
1109*ae771770SStanislav Sedov   addressed to some extent by the Unicode standard under the topic of
1110*ae771770SStanislav Sedov   canonicalization, and this work is leveraged for domain names by
1111*ae771770SStanislav Sedov   Nameprep [NAMEPREP].
1112*ae771770SStanislav Sedov
1113*ae771770SStanislav Sedov
1114*ae771770SStanislav Sedov
1115*ae771770SStanislav Sedov
1116*ae771770SStanislav Sedov
1117*ae771770SStanislav Sedov
1118*ae771770SStanislav Sedov
1119*ae771770SStanislav Sedov
1120*ae771770SStanislav Sedov
1121*ae771770SStanislav Sedov
1122*ae771770SStanislav SedovCostello                    Standards Track                    [Page 20]
1123*ae771770SStanislav Sedov
1124*ae771770SStanislav SedovRFC 3492                     IDNA Punycode                    March 2003
1125*ae771770SStanislav Sedov
1126*ae771770SStanislav Sedov
1127*ae771770SStanislav Sedov9. References
1128*ae771770SStanislav Sedov
1129*ae771770SStanislav Sedov9.1 Normative References
1130*ae771770SStanislav Sedov
1131*ae771770SStanislav Sedov   [RFC2119]    Bradner, S., "Key words for use in RFCs to Indicate
1132*ae771770SStanislav Sedov                Requirement Levels", BCP 14, RFC 2119, March 1997.
1133*ae771770SStanislav Sedov
1134*ae771770SStanislav Sedov9.2 Informative References
1135*ae771770SStanislav Sedov
1136*ae771770SStanislav Sedov   [RFC952]     Harrenstien, K., Stahl, M. and E. Feinler, "DOD Internet
1137*ae771770SStanislav Sedov                Host Table Specification", RFC 952, October 1985.
1138*ae771770SStanislav Sedov
1139*ae771770SStanislav Sedov   [RFC1034]    Mockapetris, P., "Domain Names - Concepts and
1140*ae771770SStanislav Sedov                Facilities", STD 13, RFC 1034, November 1987.
1141*ae771770SStanislav Sedov
1142*ae771770SStanislav Sedov   [IDNA]       Faltstrom, P., Hoffman, P. and A. Costello,
1143*ae771770SStanislav Sedov                "Internationalizing Domain Names in Applications
1144*ae771770SStanislav Sedov                (IDNA)", RFC 3490, March 2003.
1145*ae771770SStanislav Sedov
1146*ae771770SStanislav Sedov   [NAMEPREP]   Hoffman, P. and  M. Blanchet, "Nameprep: A Stringprep
1147*ae771770SStanislav Sedov                Profile for Internationalized Domain Names (IDN)", RFC
1148*ae771770SStanislav Sedov                3491, March 2003.
1149*ae771770SStanislav Sedov
1150*ae771770SStanislav Sedov   [ASCII]      Cerf, V., "ASCII format for Network Interchange", RFC
1151*ae771770SStanislav Sedov                20, October 1969.
1152*ae771770SStanislav Sedov
1153*ae771770SStanislav Sedov   [PROVINCIAL] Kaplan, M., "The 'anyone can be provincial!' page",
1154*ae771770SStanislav Sedov                http://www.trigeminal.com/samples/provincial.html.
1155*ae771770SStanislav Sedov
1156*ae771770SStanislav Sedov   [UNICODE]    The Unicode Consortium, "The Unicode Standard",
1157*ae771770SStanislav Sedov                http://www.unicode.org/unicode/standard/standard.html.
1158*ae771770SStanislav Sedov
1159*ae771770SStanislav Sedov
1160*ae771770SStanislav Sedov
1161*ae771770SStanislav Sedov
1162*ae771770SStanislav Sedov
1163*ae771770SStanislav Sedov
1164*ae771770SStanislav Sedov
1165*ae771770SStanislav Sedov
1166*ae771770SStanislav Sedov
1167*ae771770SStanislav Sedov
1168*ae771770SStanislav Sedov
1169*ae771770SStanislav Sedov
1170*ae771770SStanislav Sedov
1171*ae771770SStanislav Sedov
1172*ae771770SStanislav Sedov
1173*ae771770SStanislav Sedov
1174*ae771770SStanislav Sedov
1175*ae771770SStanislav Sedov
1176*ae771770SStanislav Sedov
1177*ae771770SStanislav Sedov
1178*ae771770SStanislav SedovCostello                    Standards Track                    [Page 21]
1179*ae771770SStanislav Sedov
1180*ae771770SStanislav SedovRFC 3492                     IDNA Punycode                    March 2003
1181*ae771770SStanislav Sedov
1182*ae771770SStanislav Sedov
1183*ae771770SStanislav SedovA. Mixed-case annotation
1184*ae771770SStanislav Sedov
1185*ae771770SStanislav Sedov   In order to use Punycode to represent case-insensitive strings,
1186*ae771770SStanislav Sedov   higher layers need to case-fold the strings prior to Punycode
1187*ae771770SStanislav Sedov   encoding.  The encoded string can use mixed case as an annotation
1188*ae771770SStanislav Sedov   telling how to convert the folded string into a mixed-case string for
1189*ae771770SStanislav Sedov   display purposes.  Note, however, that mixed-case annotation is not
1190*ae771770SStanislav Sedov   used by the ToASCII and ToUnicode operations specified in [IDNA], and
1191*ae771770SStanislav Sedov   therefore implementors of IDNA can disregard this appendix.
1192*ae771770SStanislav Sedov
1193*ae771770SStanislav Sedov   Basic code points can use mixed case directly, because the decoder
1194*ae771770SStanislav Sedov   copies them verbatim, leaving lowercase code points lowercase, and
1195*ae771770SStanislav Sedov   leaving uppercase code points uppercase.  Each non-basic code point
1196*ae771770SStanislav Sedov   is represented by a delta, which is represented by a sequence of
1197*ae771770SStanislav Sedov   basic code points, the last of which provides the annotation.  If it
1198*ae771770SStanislav Sedov   is uppercase, it is a suggestion to map the non-basic code point to
1199*ae771770SStanislav Sedov   uppercase (if possible); if it is lowercase, it is a suggestion to
1200*ae771770SStanislav Sedov   map the non-basic code point to lowercase (if possible).
1201*ae771770SStanislav Sedov
1202*ae771770SStanislav Sedov   These annotations do not alter the code points returned by decoders;
1203*ae771770SStanislav Sedov   the annotations are returned separately, for the caller to use or
1204*ae771770SStanislav Sedov   ignore.  Encoders can accept annotations in addition to code points,
1205*ae771770SStanislav Sedov   but the annotations do not alter the output, except to influence the
1206*ae771770SStanislav Sedov   uppercase/lowercase form of ASCII letters.
1207*ae771770SStanislav Sedov
1208*ae771770SStanislav Sedov   Punycode encoders and decoders need not support these annotations,
1209*ae771770SStanislav Sedov   and higher layers need not use them.
1210*ae771770SStanislav Sedov
1211*ae771770SStanislav SedovB. Disclaimer and license
1212*ae771770SStanislav Sedov
1213*ae771770SStanislav Sedov   Regarding this entire document or any portion of it (including the
1214*ae771770SStanislav Sedov   pseudocode and C code), the author makes no guarantees and is not
1215*ae771770SStanislav Sedov   responsible for any damage resulting from its use.  The author grants
1216*ae771770SStanislav Sedov   irrevocable permission to anyone to use, modify, and distribute it in
1217*ae771770SStanislav Sedov   any way that does not diminish the rights of anyone else to use,
1218*ae771770SStanislav Sedov   modify, and distribute it, provided that redistributed derivative
1219*ae771770SStanislav Sedov   works do not contain misleading author or version information.
1220*ae771770SStanislav Sedov   Derivative works need not be licensed under similar terms.
1221*ae771770SStanislav Sedov
1222*ae771770SStanislav Sedov
1223*ae771770SStanislav Sedov
1224*ae771770SStanislav Sedov
1225*ae771770SStanislav Sedov
1226*ae771770SStanislav Sedov
1227*ae771770SStanislav Sedov
1228*ae771770SStanislav Sedov
1229*ae771770SStanislav Sedov
1230*ae771770SStanislav Sedov
1231*ae771770SStanislav Sedov
1232*ae771770SStanislav Sedov
1233*ae771770SStanislav Sedov
1234*ae771770SStanislav SedovCostello                    Standards Track                    [Page 22]
1235*ae771770SStanislav Sedov
1236*ae771770SStanislav SedovRFC 3492                     IDNA Punycode                    March 2003
1237*ae771770SStanislav Sedov
1238*ae771770SStanislav Sedov
1239*ae771770SStanislav SedovC. Punycode sample implementation
1240*ae771770SStanislav Sedov
1241*ae771770SStanislav Sedov/*
1242*ae771770SStanislav Sedovpunycode.c from RFC 3492
1243*ae771770SStanislav Sedovhttp://www.nicemice.net/idn/
1244*ae771770SStanislav SedovAdam M. Costello
1245*ae771770SStanislav Sedovhttp://www.nicemice.net/amc/
1246*ae771770SStanislav Sedov
1247*ae771770SStanislav SedovThis is ANSI C code (C89) implementing Punycode (RFC 3492).
1248*ae771770SStanislav Sedov
1249*ae771770SStanislav Sedov*/
1250*ae771770SStanislav Sedov
1251*ae771770SStanislav Sedov
1252*ae771770SStanislav Sedov/************************************************************/
1253*ae771770SStanislav Sedov/* Public interface (would normally go in its own .h file): */
1254*ae771770SStanislav Sedov
1255*ae771770SStanislav Sedov#include <limits.h>
1256*ae771770SStanislav Sedov
1257*ae771770SStanislav Sedovenum punycode_status {
1258*ae771770SStanislav Sedov  punycode_success,
1259*ae771770SStanislav Sedov  punycode_bad_input,   /* Input is invalid.                       */
1260*ae771770SStanislav Sedov  punycode_big_output,  /* Output would exceed the space provided. */
1261*ae771770SStanislav Sedov  punycode_overflow     /* Input needs wider integers to process.  */
1262*ae771770SStanislav Sedov};
1263*ae771770SStanislav Sedov
1264*ae771770SStanislav Sedov#if UINT_MAX >= (1 << 26) - 1
1265*ae771770SStanislav Sedovtypedef unsigned int punycode_uint;
1266*ae771770SStanislav Sedov#else
1267*ae771770SStanislav Sedovtypedef unsigned long punycode_uint;
1268*ae771770SStanislav Sedov#endif
1269*ae771770SStanislav Sedov
1270*ae771770SStanislav Sedovenum punycode_status punycode_encode(
1271*ae771770SStanislav Sedov  punycode_uint input_length,
1272*ae771770SStanislav Sedov  const punycode_uint input[],
1273*ae771770SStanislav Sedov  const unsigned char case_flags[],
1274*ae771770SStanislav Sedov  punycode_uint *output_length,
1275*ae771770SStanislav Sedov  char output[] );
1276*ae771770SStanislav Sedov
1277*ae771770SStanislav Sedov    /* punycode_encode() converts Unicode to Punycode.  The input     */
1278*ae771770SStanislav Sedov    /* is represented as an array of Unicode code points (not code    */
1279*ae771770SStanislav Sedov    /* units; surrogate pairs are not allowed), and the output        */
1280*ae771770SStanislav Sedov    /* will be represented as an array of ASCII code points.  The     */
1281*ae771770SStanislav Sedov    /* output string is *not* null-terminated; it will contain        */
1282*ae771770SStanislav Sedov    /* zeros if and only if the input contains zeros.  (Of course     */
1283*ae771770SStanislav Sedov    /* the caller can leave room for a terminator and add one if      */
1284*ae771770SStanislav Sedov    /* needed.)  The input_length is the number of code points in     */
1285*ae771770SStanislav Sedov    /* the input.  The output_length is an in/out argument: the       */
1286*ae771770SStanislav Sedov    /* caller passes in the maximum number of code points that it     */
1287*ae771770SStanislav Sedov
1288*ae771770SStanislav Sedov
1289*ae771770SStanislav Sedov
1290*ae771770SStanislav SedovCostello                    Standards Track                    [Page 23]
1291*ae771770SStanislav Sedov
1292*ae771770SStanislav SedovRFC 3492                     IDNA Punycode                    March 2003
1293*ae771770SStanislav Sedov
1294*ae771770SStanislav Sedov
1295*ae771770SStanislav Sedov    /* can receive, and on successful return it will contain the      */
1296*ae771770SStanislav Sedov    /* number of code points actually output.  The case_flags array   */
1297*ae771770SStanislav Sedov    /* holds input_length boolean values, where nonzero suggests that */
1298*ae771770SStanislav Sedov    /* the corresponding Unicode character be forced to uppercase     */
1299*ae771770SStanislav Sedov    /* after being decoded (if possible), and zero suggests that      */
1300*ae771770SStanislav Sedov    /* it be forced to lowercase (if possible).  ASCII code points    */
1301*ae771770SStanislav Sedov    /* are encoded literally, except that ASCII letters are forced    */
1302*ae771770SStanislav Sedov    /* to uppercase or lowercase according to the corresponding       */
1303*ae771770SStanislav Sedov    /* uppercase flags.  If case_flags is a null pointer then ASCII   */
1304*ae771770SStanislav Sedov    /* letters are left as they are, and other code points are        */
1305*ae771770SStanislav Sedov    /* treated as if their uppercase flags were zero.  The return     */
1306*ae771770SStanislav Sedov    /* value can be any of the punycode_status values defined above   */
1307*ae771770SStanislav Sedov    /* except punycode_bad_input; if not punycode_success, then       */
1308*ae771770SStanislav Sedov    /* output_size and output might contain garbage.                  */
1309*ae771770SStanislav Sedov
1310*ae771770SStanislav Sedovenum punycode_status punycode_decode(
1311*ae771770SStanislav Sedov  punycode_uint input_length,
1312*ae771770SStanislav Sedov  const char input[],
1313*ae771770SStanislav Sedov  punycode_uint *output_length,
1314*ae771770SStanislav Sedov  punycode_uint output[],
1315*ae771770SStanislav Sedov  unsigned char case_flags[] );
1316*ae771770SStanislav Sedov
1317*ae771770SStanislav Sedov    /* punycode_decode() converts Punycode to Unicode.  The input is  */
1318*ae771770SStanislav Sedov    /* represented as an array of ASCII code points, and the output   */
1319*ae771770SStanislav Sedov    /* will be represented as an array of Unicode code points.  The   */
1320*ae771770SStanislav Sedov    /* input_length is the number of code points in the input.  The   */
1321*ae771770SStanislav Sedov    /* output_length is an in/out argument: the caller passes in      */
1322*ae771770SStanislav Sedov    /* the maximum number of code points that it can receive, and     */
1323*ae771770SStanislav Sedov    /* on successful return it will contain the actual number of      */
1324*ae771770SStanislav Sedov    /* code points output.  The case_flags array needs room for at    */
1325*ae771770SStanislav Sedov    /* least output_length values, or it can be a null pointer if the */
1326*ae771770SStanislav Sedov    /* case information is not needed.  A nonzero flag suggests that  */
1327*ae771770SStanislav Sedov    /* the corresponding Unicode character be forced to uppercase     */
1328*ae771770SStanislav Sedov    /* by the caller (if possible), while zero suggests that it be    */
1329*ae771770SStanislav Sedov    /* forced to lowercase (if possible).  ASCII code points are      */
1330*ae771770SStanislav Sedov    /* output already in the proper case, but their flags will be set */
1331*ae771770SStanislav Sedov    /* appropriately so that applying the flags would be harmless.    */
1332*ae771770SStanislav Sedov    /* The return value can be any of the punycode_status values      */
1333*ae771770SStanislav Sedov    /* defined above; if not punycode_success, then output_length,    */
1334*ae771770SStanislav Sedov    /* output, and case_flags might contain garbage.  On success, the */
1335*ae771770SStanislav Sedov    /* decoder will never need to write an output_length greater than */
1336*ae771770SStanislav Sedov    /* input_length, because of how the encoding is defined.          */
1337*ae771770SStanislav Sedov
1338*ae771770SStanislav Sedov/**********************************************************/
1339*ae771770SStanislav Sedov/* Implementation (would normally go in its own .c file): */
1340*ae771770SStanislav Sedov
1341*ae771770SStanislav Sedov#include <string.h>
1342*ae771770SStanislav Sedov
1343*ae771770SStanislav Sedov
1344*ae771770SStanislav Sedov
1345*ae771770SStanislav Sedov
1346*ae771770SStanislav SedovCostello                    Standards Track                    [Page 24]
1347*ae771770SStanislav Sedov
1348*ae771770SStanislav SedovRFC 3492                     IDNA Punycode                    March 2003
1349*ae771770SStanislav Sedov
1350*ae771770SStanislav Sedov
1351*ae771770SStanislav Sedov/*** Bootstring parameters for Punycode ***/
1352*ae771770SStanislav Sedov
1353*ae771770SStanislav Sedovenum { base = 36, tmin = 1, tmax = 26, skew = 38, damp = 700,
1354*ae771770SStanislav Sedov       initial_bias = 72, initial_n = 0x80, delimiter = 0x2D };
1355*ae771770SStanislav Sedov
1356*ae771770SStanislav Sedov/* basic(cp) tests whether cp is a basic code point: */
1357*ae771770SStanislav Sedov#define basic(cp) ((punycode_uint)(cp) < 0x80)
1358*ae771770SStanislav Sedov
1359*ae771770SStanislav Sedov/* delim(cp) tests whether cp is a delimiter: */
1360*ae771770SStanislav Sedov#define delim(cp) ((cp) == delimiter)
1361*ae771770SStanislav Sedov
1362*ae771770SStanislav Sedov/* decode_digit(cp) returns the numeric value of a basic code */
1363*ae771770SStanislav Sedov/* point (for use in representing integers) in the range 0 to */
1364*ae771770SStanislav Sedov/* base-1, or base if cp is does not represent a value.       */
1365*ae771770SStanislav Sedov
1366*ae771770SStanislav Sedovstatic punycode_uint decode_digit(punycode_uint cp)
1367*ae771770SStanislav Sedov{
1368*ae771770SStanislav Sedov  return  cp - 48 < 10 ? cp - 22 :  cp - 65 < 26 ? cp - 65 :
1369*ae771770SStanislav Sedov          cp - 97 < 26 ? cp - 97 :  base;
1370*ae771770SStanislav Sedov}
1371*ae771770SStanislav Sedov
1372*ae771770SStanislav Sedov/* encode_digit(d,flag) returns the basic code point whose value      */
1373*ae771770SStanislav Sedov/* (when used for representing integers) is d, which needs to be in   */
1374*ae771770SStanislav Sedov/* the range 0 to base-1.  The lowercase form is used unless flag is  */
1375*ae771770SStanislav Sedov/* nonzero, in which case the uppercase form is used.  The behavior   */
1376*ae771770SStanislav Sedov/* is undefined if flag is nonzero and digit d has no uppercase form. */
1377*ae771770SStanislav Sedov
1378*ae771770SStanislav Sedovstatic char encode_digit(punycode_uint d, int flag)
1379*ae771770SStanislav Sedov{
1380*ae771770SStanislav Sedov  return d + 22 + 75 * (d < 26) - ((flag != 0) << 5);
1381*ae771770SStanislav Sedov  /*  0..25 map to ASCII a..z or A..Z */
1382*ae771770SStanislav Sedov  /* 26..35 map to ASCII 0..9         */
1383*ae771770SStanislav Sedov}
1384*ae771770SStanislav Sedov
1385*ae771770SStanislav Sedov/* flagged(bcp) tests whether a basic code point is flagged */
1386*ae771770SStanislav Sedov/* (uppercase).  The behavior is undefined if bcp is not a  */
1387*ae771770SStanislav Sedov/* basic code point.                                        */
1388*ae771770SStanislav Sedov
1389*ae771770SStanislav Sedov#define flagged(bcp) ((punycode_uint)(bcp) - 65 < 26)
1390*ae771770SStanislav Sedov
1391*ae771770SStanislav Sedov/* encode_basic(bcp,flag) forces a basic code point to lowercase */
1392*ae771770SStanislav Sedov/* if flag is zero, uppercase if flag is nonzero, and returns    */
1393*ae771770SStanislav Sedov/* the resulting code point.  The code point is unchanged if it  */
1394*ae771770SStanislav Sedov/* is caseless.  The behavior is undefined if bcp is not a basic */
1395*ae771770SStanislav Sedov/* code point.                                                   */
1396*ae771770SStanislav Sedov
1397*ae771770SStanislav Sedovstatic char encode_basic(punycode_uint bcp, int flag)
1398*ae771770SStanislav Sedov{
1399*ae771770SStanislav Sedov
1400*ae771770SStanislav Sedov
1401*ae771770SStanislav Sedov
1402*ae771770SStanislav SedovCostello                    Standards Track                    [Page 25]
1403*ae771770SStanislav Sedov
1404*ae771770SStanislav SedovRFC 3492                     IDNA Punycode                    March 2003
1405*ae771770SStanislav Sedov
1406*ae771770SStanislav Sedov
1407*ae771770SStanislav Sedov  bcp -= (bcp - 97 < 26) << 5;
1408*ae771770SStanislav Sedov  return bcp + ((!flag && (bcp - 65 < 26)) << 5);
1409*ae771770SStanislav Sedov}
1410*ae771770SStanislav Sedov
1411*ae771770SStanislav Sedov/*** Platform-specific constants ***/
1412*ae771770SStanislav Sedov
1413*ae771770SStanislav Sedov/* maxint is the maximum value of a punycode_uint variable: */
1414*ae771770SStanislav Sedovstatic const punycode_uint maxint = -1;
1415*ae771770SStanislav Sedov/* Because maxint is unsigned, -1 becomes the maximum value. */
1416*ae771770SStanislav Sedov
1417*ae771770SStanislav Sedov/*** Bias adaptation function ***/
1418*ae771770SStanislav Sedov
1419*ae771770SStanislav Sedovstatic punycode_uint adapt(
1420*ae771770SStanislav Sedov  punycode_uint delta, punycode_uint numpoints, int firsttime )
1421*ae771770SStanislav Sedov{
1422*ae771770SStanislav Sedov  punycode_uint k;
1423*ae771770SStanislav Sedov
1424*ae771770SStanislav Sedov  delta = firsttime ? delta / damp : delta >> 1;
1425*ae771770SStanislav Sedov  /* delta >> 1 is a faster way of doing delta / 2 */
1426*ae771770SStanislav Sedov  delta += delta / numpoints;
1427*ae771770SStanislav Sedov
1428*ae771770SStanislav Sedov  for (k = 0;  delta > ((base - tmin) * tmax) / 2;  k += base) {
1429*ae771770SStanislav Sedov    delta /= base - tmin;
1430*ae771770SStanislav Sedov  }
1431*ae771770SStanislav Sedov
1432*ae771770SStanislav Sedov  return k + (base - tmin + 1) * delta / (delta + skew);
1433*ae771770SStanislav Sedov}
1434*ae771770SStanislav Sedov
1435*ae771770SStanislav Sedov/*** Main encode function ***/
1436*ae771770SStanislav Sedov
1437*ae771770SStanislav Sedovenum punycode_status punycode_encode(
1438*ae771770SStanislav Sedov  punycode_uint input_length,
1439*ae771770SStanislav Sedov  const punycode_uint input[],
1440*ae771770SStanislav Sedov  const unsigned char case_flags[],
1441*ae771770SStanislav Sedov  punycode_uint *output_length,
1442*ae771770SStanislav Sedov  char output[] )
1443*ae771770SStanislav Sedov{
1444*ae771770SStanislav Sedov  punycode_uint n, delta, h, b, out, max_out, bias, j, m, q, k, t;
1445*ae771770SStanislav Sedov
1446*ae771770SStanislav Sedov  /* Initialize the state: */
1447*ae771770SStanislav Sedov
1448*ae771770SStanislav Sedov  n = initial_n;
1449*ae771770SStanislav Sedov  delta = out = 0;
1450*ae771770SStanislav Sedov  max_out = *output_length;
1451*ae771770SStanislav Sedov  bias = initial_bias;
1452*ae771770SStanislav Sedov
1453*ae771770SStanislav Sedov  /* Handle the basic code points: */
1454*ae771770SStanislav Sedov
1455*ae771770SStanislav Sedov
1456*ae771770SStanislav Sedov
1457*ae771770SStanislav Sedov
1458*ae771770SStanislav SedovCostello                    Standards Track                    [Page 26]
1459*ae771770SStanislav Sedov
1460*ae771770SStanislav SedovRFC 3492                     IDNA Punycode                    March 2003
1461*ae771770SStanislav Sedov
1462*ae771770SStanislav Sedov
1463*ae771770SStanislav Sedov  for (j = 0;  j < input_length;  ++j) {
1464*ae771770SStanislav Sedov    if (basic(input[j])) {
1465*ae771770SStanislav Sedov      if (max_out - out < 2) return punycode_big_output;
1466*ae771770SStanislav Sedov      output[out++] =
1467*ae771770SStanislav Sedov        case_flags ?  encode_basic(input[j], case_flags[j]) : input[j];
1468*ae771770SStanislav Sedov    }
1469*ae771770SStanislav Sedov    /* else if (input[j] < n) return punycode_bad_input; */
1470*ae771770SStanislav Sedov    /* (not needed for Punycode with unsigned code points) */
1471*ae771770SStanislav Sedov  }
1472*ae771770SStanislav Sedov
1473*ae771770SStanislav Sedov  h = b = out;
1474*ae771770SStanislav Sedov
1475*ae771770SStanislav Sedov  /* h is the number of code points that have been handled, b is the  */
1476*ae771770SStanislav Sedov  /* number of basic code points, and out is the number of characters */
1477*ae771770SStanislav Sedov  /* that have been output.                                           */
1478*ae771770SStanislav Sedov
1479*ae771770SStanislav Sedov  if (b > 0) output[out++] = delimiter;
1480*ae771770SStanislav Sedov
1481*ae771770SStanislav Sedov  /* Main encoding loop: */
1482*ae771770SStanislav Sedov
1483*ae771770SStanislav Sedov  while (h < input_length) {
1484*ae771770SStanislav Sedov    /* All non-basic code points < n have been     */
1485*ae771770SStanislav Sedov    /* handled already.  Find the next larger one: */
1486*ae771770SStanislav Sedov
1487*ae771770SStanislav Sedov    for (m = maxint, j = 0;  j < input_length;  ++j) {
1488*ae771770SStanislav Sedov      /* if (basic(input[j])) continue; */
1489*ae771770SStanislav Sedov      /* (not needed for Punycode) */
1490*ae771770SStanislav Sedov      if (input[j] >= n && input[j] < m) m = input[j];
1491*ae771770SStanislav Sedov    }
1492*ae771770SStanislav Sedov
1493*ae771770SStanislav Sedov    /* Increase delta enough to advance the decoder's    */
1494*ae771770SStanislav Sedov    /* <n,i> state to <m,0>, but guard against overflow: */
1495*ae771770SStanislav Sedov
1496*ae771770SStanislav Sedov    if (m - n > (maxint - delta) / (h + 1)) return punycode_overflow;
1497*ae771770SStanislav Sedov    delta += (m - n) * (h + 1);
1498*ae771770SStanislav Sedov    n = m;
1499*ae771770SStanislav Sedov
1500*ae771770SStanislav Sedov    for (j = 0;  j < input_length;  ++j) {
1501*ae771770SStanislav Sedov      /* Punycode does not need to check whether input[j] is basic: */
1502*ae771770SStanislav Sedov      if (input[j] < n /* || basic(input[j]) */ ) {
1503*ae771770SStanislav Sedov        if (++delta == 0) return punycode_overflow;
1504*ae771770SStanislav Sedov      }
1505*ae771770SStanislav Sedov
1506*ae771770SStanislav Sedov      if (input[j] == n) {
1507*ae771770SStanislav Sedov        /* Represent delta as a generalized variable-length integer: */
1508*ae771770SStanislav Sedov
1509*ae771770SStanislav Sedov        for (q = delta, k = base;  ;  k += base) {
1510*ae771770SStanislav Sedov          if (out >= max_out) return punycode_big_output;
1511*ae771770SStanislav Sedov
1512*ae771770SStanislav Sedov
1513*ae771770SStanislav Sedov
1514*ae771770SStanislav SedovCostello                    Standards Track                    [Page 27]
1515*ae771770SStanislav Sedov
1516*ae771770SStanislav SedovRFC 3492                     IDNA Punycode                    March 2003
1517*ae771770SStanislav Sedov
1518*ae771770SStanislav Sedov
1519*ae771770SStanislav Sedov          t = k <= bias /* + tmin */ ? tmin :     /* +tmin not needed */
1520*ae771770SStanislav Sedov              k >= bias + tmax ? tmax : k - bias;
1521*ae771770SStanislav Sedov          if (q < t) break;
1522*ae771770SStanislav Sedov          output[out++] = encode_digit(t + (q - t) % (base - t), 0);
1523*ae771770SStanislav Sedov          q = (q - t) / (base - t);
1524*ae771770SStanislav Sedov        }
1525*ae771770SStanislav Sedov
1526*ae771770SStanislav Sedov        output[out++] = encode_digit(q, case_flags && case_flags[j]);
1527*ae771770SStanislav Sedov        bias = adapt(delta, h + 1, h == b);
1528*ae771770SStanislav Sedov        delta = 0;
1529*ae771770SStanislav Sedov        ++h;
1530*ae771770SStanislav Sedov      }
1531*ae771770SStanislav Sedov    }
1532*ae771770SStanislav Sedov
1533*ae771770SStanislav Sedov    ++delta, ++n;
1534*ae771770SStanislav Sedov  }
1535*ae771770SStanislav Sedov
1536*ae771770SStanislav Sedov  *output_length = out;
1537*ae771770SStanislav Sedov  return punycode_success;
1538*ae771770SStanislav Sedov}
1539*ae771770SStanislav Sedov
1540*ae771770SStanislav Sedov/*** Main decode function ***/
1541*ae771770SStanislav Sedov
1542*ae771770SStanislav Sedovenum punycode_status punycode_decode(
1543*ae771770SStanislav Sedov  punycode_uint input_length,
1544*ae771770SStanislav Sedov  const char input[],
1545*ae771770SStanislav Sedov  punycode_uint *output_length,
1546*ae771770SStanislav Sedov  punycode_uint output[],
1547*ae771770SStanislav Sedov  unsigned char case_flags[] )
1548*ae771770SStanislav Sedov{
1549*ae771770SStanislav Sedov  punycode_uint n, out, i, max_out, bias,
1550*ae771770SStanislav Sedov                 b, j, in, oldi, w, k, digit, t;
1551*ae771770SStanislav Sedov
1552*ae771770SStanislav Sedov  /* Initialize the state: */
1553*ae771770SStanislav Sedov
1554*ae771770SStanislav Sedov  n = initial_n;
1555*ae771770SStanislav Sedov  out = i = 0;
1556*ae771770SStanislav Sedov  max_out = *output_length;
1557*ae771770SStanislav Sedov  bias = initial_bias;
1558*ae771770SStanislav Sedov
1559*ae771770SStanislav Sedov  /* Handle the basic code points:  Let b be the number of input code */
1560*ae771770SStanislav Sedov  /* points before the last delimiter, or 0 if there is none, then    */
1561*ae771770SStanislav Sedov  /* copy the first b code points to the output.                      */
1562*ae771770SStanislav Sedov
1563*ae771770SStanislav Sedov  for (b = j = 0;  j < input_length;  ++j) if (delim(input[j])) b = j;
1564*ae771770SStanislav Sedov  if (b > max_out) return punycode_big_output;
1565*ae771770SStanislav Sedov
1566*ae771770SStanislav Sedov  for (j = 0;  j < b;  ++j) {
1567*ae771770SStanislav Sedov
1568*ae771770SStanislav Sedov
1569*ae771770SStanislav Sedov
1570*ae771770SStanislav SedovCostello                    Standards Track                    [Page 28]
1571*ae771770SStanislav Sedov
1572*ae771770SStanislav SedovRFC 3492                     IDNA Punycode                    March 2003
1573*ae771770SStanislav Sedov
1574*ae771770SStanislav Sedov
1575*ae771770SStanislav Sedov    if (case_flags) case_flags[out] = flagged(input[j]);
1576*ae771770SStanislav Sedov    if (!basic(input[j])) return punycode_bad_input;
1577*ae771770SStanislav Sedov    output[out++] = input[j];
1578*ae771770SStanislav Sedov  }
1579*ae771770SStanislav Sedov
1580*ae771770SStanislav Sedov  /* Main decoding loop:  Start just after the last delimiter if any  */
1581*ae771770SStanislav Sedov  /* basic code points were copied; start at the beginning otherwise. */
1582*ae771770SStanislav Sedov
1583*ae771770SStanislav Sedov  for (in = b > 0 ? b + 1 : 0;  in < input_length;  ++out) {
1584*ae771770SStanislav Sedov
1585*ae771770SStanislav Sedov    /* in is the index of the next character to be consumed, and */
1586*ae771770SStanislav Sedov    /* out is the number of code points in the output array.     */
1587*ae771770SStanislav Sedov
1588*ae771770SStanislav Sedov    /* Decode a generalized variable-length integer into delta,  */
1589*ae771770SStanislav Sedov    /* which gets added to i.  The overflow checking is easier   */
1590*ae771770SStanislav Sedov    /* if we increase i as we go, then subtract off its starting */
1591*ae771770SStanislav Sedov    /* value at the end to obtain delta.                         */
1592*ae771770SStanislav Sedov
1593*ae771770SStanislav Sedov    for (oldi = i, w = 1, k = base;  ;  k += base) {
1594*ae771770SStanislav Sedov      if (in >= input_length) return punycode_bad_input;
1595*ae771770SStanislav Sedov      digit = decode_digit(input[in++]);
1596*ae771770SStanislav Sedov      if (digit >= base) return punycode_bad_input;
1597*ae771770SStanislav Sedov      if (digit > (maxint - i) / w) return punycode_overflow;
1598*ae771770SStanislav Sedov      i += digit * w;
1599*ae771770SStanislav Sedov      t = k <= bias /* + tmin */ ? tmin :     /* +tmin not needed */
1600*ae771770SStanislav Sedov          k >= bias + tmax ? tmax : k - bias;
1601*ae771770SStanislav Sedov      if (digit < t) break;
1602*ae771770SStanislav Sedov      if (w > maxint / (base - t)) return punycode_overflow;
1603*ae771770SStanislav Sedov      w *= (base - t);
1604*ae771770SStanislav Sedov    }
1605*ae771770SStanislav Sedov
1606*ae771770SStanislav Sedov    bias = adapt(i - oldi, out + 1, oldi == 0);
1607*ae771770SStanislav Sedov
1608*ae771770SStanislav Sedov    /* i was supposed to wrap around from out+1 to 0,   */
1609*ae771770SStanislav Sedov    /* incrementing n each time, so we'll fix that now: */
1610*ae771770SStanislav Sedov
1611*ae771770SStanislav Sedov    if (i / (out + 1) > maxint - n) return punycode_overflow;
1612*ae771770SStanislav Sedov    n += i / (out + 1);
1613*ae771770SStanislav Sedov    i %= (out + 1);
1614*ae771770SStanislav Sedov
1615*ae771770SStanislav Sedov    /* Insert n at position i of the output: */
1616*ae771770SStanislav Sedov
1617*ae771770SStanislav Sedov    /* not needed for Punycode: */
1618*ae771770SStanislav Sedov    /* if (decode_digit(n) <= base) return punycode_invalid_input; */
1619*ae771770SStanislav Sedov    if (out >= max_out) return punycode_big_output;
1620*ae771770SStanislav Sedov
1621*ae771770SStanislav Sedov    if (case_flags) {
1622*ae771770SStanislav Sedov      memmove(case_flags + i + 1, case_flags + i, out - i);
1623*ae771770SStanislav Sedov
1624*ae771770SStanislav Sedov
1625*ae771770SStanislav Sedov
1626*ae771770SStanislav SedovCostello                    Standards Track                    [Page 29]
1627*ae771770SStanislav Sedov
1628*ae771770SStanislav SedovRFC 3492                     IDNA Punycode                    March 2003
1629*ae771770SStanislav Sedov
1630*ae771770SStanislav Sedov
1631*ae771770SStanislav Sedov      /* Case of last character determines uppercase flag: */
1632*ae771770SStanislav Sedov      case_flags[i] = flagged(input[in - 1]);
1633*ae771770SStanislav Sedov    }
1634*ae771770SStanislav Sedov
1635*ae771770SStanislav Sedov    memmove(output + i + 1, output + i, (out - i) * sizeof *output);
1636*ae771770SStanislav Sedov    output[i++] = n;
1637*ae771770SStanislav Sedov  }
1638*ae771770SStanislav Sedov
1639*ae771770SStanislav Sedov  *output_length = out;
1640*ae771770SStanislav Sedov  return punycode_success;
1641*ae771770SStanislav Sedov}
1642*ae771770SStanislav Sedov
1643*ae771770SStanislav Sedov/******************************************************************/
1644*ae771770SStanislav Sedov/* Wrapper for testing (would normally go in a separate .c file): */
1645*ae771770SStanislav Sedov
1646*ae771770SStanislav Sedov#include <assert.h>
1647*ae771770SStanislav Sedov#include <stdio.h>
1648*ae771770SStanislav Sedov#include <stdlib.h>
1649*ae771770SStanislav Sedov#include <string.h>
1650*ae771770SStanislav Sedov
1651*ae771770SStanislav Sedov/* For testing, we'll just set some compile-time limits rather than */
1652*ae771770SStanislav Sedov/* use malloc(), and set a compile-time option rather than using a  */
1653*ae771770SStanislav Sedov/* command-line option.                                             */
1654*ae771770SStanislav Sedov
1655*ae771770SStanislav Sedovenum {
1656*ae771770SStanislav Sedov  unicode_max_length = 256,
1657*ae771770SStanislav Sedov  ace_max_length = 256
1658*ae771770SStanislav Sedov};
1659*ae771770SStanislav Sedov
1660*ae771770SStanislav Sedovstatic void usage(char **argv)
1661*ae771770SStanislav Sedov{
1662*ae771770SStanislav Sedov  fprintf(stderr,
1663*ae771770SStanislav Sedov    "\n"
1664*ae771770SStanislav Sedov    "%s -e reads code points and writes a Punycode string.\n"
1665*ae771770SStanislav Sedov    "%s -d reads a Punycode string and writes code points.\n"
1666*ae771770SStanislav Sedov    "\n"
1667*ae771770SStanislav Sedov    "Input and output are plain text in the native character set.\n"
1668*ae771770SStanislav Sedov    "Code points are in the form u+hex separated by whitespace.\n"
1669*ae771770SStanislav Sedov    "Although the specification allows Punycode strings to contain\n"
1670*ae771770SStanislav Sedov    "any characters from the ASCII repertoire, this test code\n"
1671*ae771770SStanislav Sedov    "supports only the printable characters, and needs the Punycode\n"
1672*ae771770SStanislav Sedov    "string to be followed by a newline.\n"
1673*ae771770SStanislav Sedov    "The case of the u in u+hex is the force-to-uppercase flag.\n"
1674*ae771770SStanislav Sedov    , argv[0], argv[0]);
1675*ae771770SStanislav Sedov  exit(EXIT_FAILURE);
1676*ae771770SStanislav Sedov}
1677*ae771770SStanislav Sedov
1678*ae771770SStanislav Sedovstatic void fail(const char *msg)
1679*ae771770SStanislav Sedov
1680*ae771770SStanislav Sedov
1681*ae771770SStanislav Sedov
1682*ae771770SStanislav SedovCostello                    Standards Track                    [Page 30]
1683*ae771770SStanislav Sedov
1684*ae771770SStanislav SedovRFC 3492                     IDNA Punycode                    March 2003
1685*ae771770SStanislav Sedov
1686*ae771770SStanislav Sedov
1687*ae771770SStanislav Sedov{
1688*ae771770SStanislav Sedov  fputs(msg,stderr);
1689*ae771770SStanislav Sedov  exit(EXIT_FAILURE);
1690*ae771770SStanislav Sedov}
1691*ae771770SStanislav Sedov
1692*ae771770SStanislav Sedovstatic const char too_big[] =
1693*ae771770SStanislav Sedov  "input or output is too large, recompile with larger limits\n";
1694*ae771770SStanislav Sedovstatic const char invalid_input[] = "invalid input\n";
1695*ae771770SStanislav Sedovstatic const char overflow[] = "arithmetic overflow\n";
1696*ae771770SStanislav Sedovstatic const char io_error[] = "I/O error\n";
1697*ae771770SStanislav Sedov
1698*ae771770SStanislav Sedov/* The following string is used to convert printable */
1699*ae771770SStanislav Sedov/* characters between ASCII and the native charset:  */
1700*ae771770SStanislav Sedov
1701*ae771770SStanislav Sedovstatic const char print_ascii[] =
1702*ae771770SStanislav Sedov  "\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n"
1703*ae771770SStanislav Sedov  "\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n"
1704*ae771770SStanislav Sedov  " !\"#$%&'()*+,-./"
1705*ae771770SStanislav Sedov  "0123456789:;<=>?"
1706*ae771770SStanislav Sedov  "@ABCDEFGHIJKLMNO"
1707*ae771770SStanislav Sedov  "PQRSTUVWXYZ[\\]^_"
1708*ae771770SStanislav Sedov  "`abcdefghijklmno"
1709*ae771770SStanislav Sedov  "pqrstuvwxyz{|}~\n";
1710*ae771770SStanislav Sedov
1711*ae771770SStanislav Sedovint main(int argc, char **argv)
1712*ae771770SStanislav Sedov{
1713*ae771770SStanislav Sedov  enum punycode_status status;
1714*ae771770SStanislav Sedov  int r;
1715*ae771770SStanislav Sedov  unsigned int input_length, output_length, j;
1716*ae771770SStanislav Sedov  unsigned char case_flags[unicode_max_length];
1717*ae771770SStanislav Sedov
1718*ae771770SStanislav Sedov  if (argc != 2) usage(argv);
1719*ae771770SStanislav Sedov  if (argv[1][0] != '-') usage(argv);
1720*ae771770SStanislav Sedov  if (argv[1][2] != 0) usage(argv);
1721*ae771770SStanislav Sedov
1722*ae771770SStanislav Sedov  if (argv[1][1] == 'e') {
1723*ae771770SStanislav Sedov    punycode_uint input[unicode_max_length];
1724*ae771770SStanislav Sedov    unsigned long codept;
1725*ae771770SStanislav Sedov    char output[ace_max_length+1], uplus[3];
1726*ae771770SStanislav Sedov    int c;
1727*ae771770SStanislav Sedov
1728*ae771770SStanislav Sedov    /* Read the input code points: */
1729*ae771770SStanislav Sedov
1730*ae771770SStanislav Sedov    input_length = 0;
1731*ae771770SStanislav Sedov
1732*ae771770SStanislav Sedov    for (;;) {
1733*ae771770SStanislav Sedov      r = scanf("%2s%lx", uplus, &codept);
1734*ae771770SStanislav Sedov      if (ferror(stdin)) fail(io_error);
1735*ae771770SStanislav Sedov
1736*ae771770SStanislav Sedov
1737*ae771770SStanislav Sedov
1738*ae771770SStanislav SedovCostello                    Standards Track                    [Page 31]
1739*ae771770SStanislav Sedov
1740*ae771770SStanislav SedovRFC 3492                     IDNA Punycode                    March 2003
1741*ae771770SStanislav Sedov
1742*ae771770SStanislav Sedov
1743*ae771770SStanislav Sedov      if (r == EOF || r == 0) break;
1744*ae771770SStanislav Sedov
1745*ae771770SStanislav Sedov      if (r != 2 || uplus[1] != '+' || codept > (punycode_uint)-1) {
1746*ae771770SStanislav Sedov        fail(invalid_input);
1747*ae771770SStanislav Sedov      }
1748*ae771770SStanislav Sedov
1749*ae771770SStanislav Sedov      if (input_length == unicode_max_length) fail(too_big);
1750*ae771770SStanislav Sedov
1751*ae771770SStanislav Sedov      if (uplus[0] == 'u') case_flags[input_length] = 0;
1752*ae771770SStanislav Sedov      else if (uplus[0] == 'U') case_flags[input_length] = 1;
1753*ae771770SStanislav Sedov      else fail(invalid_input);
1754*ae771770SStanislav Sedov
1755*ae771770SStanislav Sedov      input[input_length++] = codept;
1756*ae771770SStanislav Sedov    }
1757*ae771770SStanislav Sedov
1758*ae771770SStanislav Sedov    /* Encode: */
1759*ae771770SStanislav Sedov
1760*ae771770SStanislav Sedov    output_length = ace_max_length;
1761*ae771770SStanislav Sedov    status = punycode_encode(input_length, input, case_flags,
1762*ae771770SStanislav Sedov                             &output_length, output);
1763*ae771770SStanislav Sedov    if (status == punycode_bad_input) fail(invalid_input);
1764*ae771770SStanislav Sedov    if (status == punycode_big_output) fail(too_big);
1765*ae771770SStanislav Sedov    if (status == punycode_overflow) fail(overflow);
1766*ae771770SStanislav Sedov    assert(status == punycode_success);
1767*ae771770SStanislav Sedov
1768*ae771770SStanislav Sedov    /* Convert to native charset and output: */
1769*ae771770SStanislav Sedov
1770*ae771770SStanislav Sedov    for (j = 0;  j < output_length;  ++j) {
1771*ae771770SStanislav Sedov      c = output[j];
1772*ae771770SStanislav Sedov      assert(c >= 0 && c <= 127);
1773*ae771770SStanislav Sedov      if (print_ascii[c] == 0) fail(invalid_input);
1774*ae771770SStanislav Sedov      output[j] = print_ascii[c];
1775*ae771770SStanislav Sedov    }
1776*ae771770SStanislav Sedov
1777*ae771770SStanislav Sedov    output[j] = 0;
1778*ae771770SStanislav Sedov    r = puts(output);
1779*ae771770SStanislav Sedov    if (r == EOF) fail(io_error);
1780*ae771770SStanislav Sedov    return EXIT_SUCCESS;
1781*ae771770SStanislav Sedov  }
1782*ae771770SStanislav Sedov
1783*ae771770SStanislav Sedov  if (argv[1][1] == 'd') {
1784*ae771770SStanislav Sedov    char input[ace_max_length+2], *p, *pp;
1785*ae771770SStanislav Sedov    punycode_uint output[unicode_max_length];
1786*ae771770SStanislav Sedov
1787*ae771770SStanislav Sedov    /* Read the Punycode input string and convert to ASCII: */
1788*ae771770SStanislav Sedov
1789*ae771770SStanislav Sedov    fgets(input, ace_max_length+2, stdin);
1790*ae771770SStanislav Sedov    if (ferror(stdin)) fail(io_error);
1791*ae771770SStanislav Sedov
1792*ae771770SStanislav Sedov
1793*ae771770SStanislav Sedov
1794*ae771770SStanislav SedovCostello                    Standards Track                    [Page 32]
1795*ae771770SStanislav Sedov
1796*ae771770SStanislav SedovRFC 3492                     IDNA Punycode                    March 2003
1797*ae771770SStanislav Sedov
1798*ae771770SStanislav Sedov
1799*ae771770SStanislav Sedov    if (feof(stdin)) fail(invalid_input);
1800*ae771770SStanislav Sedov    input_length = strlen(input) - 1;
1801*ae771770SStanislav Sedov    if (input[input_length] != '\n') fail(too_big);
1802*ae771770SStanislav Sedov    input[input_length] = 0;
1803*ae771770SStanislav Sedov
1804*ae771770SStanislav Sedov    for (p = input;  *p != 0;  ++p) {
1805*ae771770SStanislav Sedov      pp = strchr(print_ascii, *p);
1806*ae771770SStanislav Sedov      if (pp == 0) fail(invalid_input);
1807*ae771770SStanislav Sedov      *p = pp - print_ascii;
1808*ae771770SStanislav Sedov    }
1809*ae771770SStanislav Sedov
1810*ae771770SStanislav Sedov    /* Decode: */
1811*ae771770SStanislav Sedov
1812*ae771770SStanislav Sedov    output_length = unicode_max_length;
1813*ae771770SStanislav Sedov    status = punycode_decode(input_length, input, &output_length,
1814*ae771770SStanislav Sedov                             output, case_flags);
1815*ae771770SStanislav Sedov    if (status == punycode_bad_input) fail(invalid_input);
1816*ae771770SStanislav Sedov    if (status == punycode_big_output) fail(too_big);
1817*ae771770SStanislav Sedov    if (status == punycode_overflow) fail(overflow);
1818*ae771770SStanislav Sedov    assert(status == punycode_success);
1819*ae771770SStanislav Sedov
1820*ae771770SStanislav Sedov    /* Output the result: */
1821*ae771770SStanislav Sedov
1822*ae771770SStanislav Sedov    for (j = 0;  j < output_length;  ++j) {
1823*ae771770SStanislav Sedov      r = printf("%s+%04lX\n",
1824*ae771770SStanislav Sedov                 case_flags[j] ? "U" : "u",
1825*ae771770SStanislav Sedov                 (unsigned long) output[j] );
1826*ae771770SStanislav Sedov      if (r < 0) fail(io_error);
1827*ae771770SStanislav Sedov    }
1828*ae771770SStanislav Sedov
1829*ae771770SStanislav Sedov    return EXIT_SUCCESS;
1830*ae771770SStanislav Sedov  }
1831*ae771770SStanislav Sedov
1832*ae771770SStanislav Sedov  usage(argv);
1833*ae771770SStanislav Sedov  return EXIT_SUCCESS;  /* not reached, but quiets compiler warning */
1834*ae771770SStanislav Sedov}
1835*ae771770SStanislav Sedov
1836*ae771770SStanislav Sedov
1837*ae771770SStanislav Sedov
1838*ae771770SStanislav Sedov
1839*ae771770SStanislav Sedov
1840*ae771770SStanislav Sedov
1841*ae771770SStanislav Sedov
1842*ae771770SStanislav Sedov
1843*ae771770SStanislav Sedov
1844*ae771770SStanislav Sedov
1845*ae771770SStanislav Sedov
1846*ae771770SStanislav Sedov
1847*ae771770SStanislav Sedov
1848*ae771770SStanislav Sedov
1849*ae771770SStanislav Sedov
1850*ae771770SStanislav SedovCostello                    Standards Track                    [Page 33]
1851*ae771770SStanislav Sedov
1852*ae771770SStanislav SedovRFC 3492                     IDNA Punycode                    March 2003
1853*ae771770SStanislav Sedov
1854*ae771770SStanislav Sedov
1855*ae771770SStanislav SedovAuthor's Address
1856*ae771770SStanislav Sedov
1857*ae771770SStanislav Sedov   Adam M. Costello
1858*ae771770SStanislav Sedov   University of California, Berkeley
1859*ae771770SStanislav Sedov   http://www.nicemice.net/amc/
1860*ae771770SStanislav Sedov
1861*ae771770SStanislav Sedov
1862*ae771770SStanislav Sedov
1863*ae771770SStanislav Sedov
1864*ae771770SStanislav Sedov
1865*ae771770SStanislav Sedov
1866*ae771770SStanislav Sedov
1867*ae771770SStanislav Sedov
1868*ae771770SStanislav Sedov
1869*ae771770SStanislav Sedov
1870*ae771770SStanislav Sedov
1871*ae771770SStanislav Sedov
1872*ae771770SStanislav Sedov
1873*ae771770SStanislav Sedov
1874*ae771770SStanislav Sedov
1875*ae771770SStanislav Sedov
1876*ae771770SStanislav Sedov
1877*ae771770SStanislav Sedov
1878*ae771770SStanislav Sedov
1879*ae771770SStanislav Sedov
1880*ae771770SStanislav Sedov
1881*ae771770SStanislav Sedov
1882*ae771770SStanislav Sedov
1883*ae771770SStanislav Sedov
1884*ae771770SStanislav Sedov
1885*ae771770SStanislav Sedov
1886*ae771770SStanislav Sedov
1887*ae771770SStanislav Sedov
1888*ae771770SStanislav Sedov
1889*ae771770SStanislav Sedov
1890*ae771770SStanislav Sedov
1891*ae771770SStanislav Sedov
1892*ae771770SStanislav Sedov
1893*ae771770SStanislav Sedov
1894*ae771770SStanislav Sedov
1895*ae771770SStanislav Sedov
1896*ae771770SStanislav Sedov
1897*ae771770SStanislav Sedov
1898*ae771770SStanislav Sedov
1899*ae771770SStanislav Sedov
1900*ae771770SStanislav Sedov
1901*ae771770SStanislav Sedov
1902*ae771770SStanislav Sedov
1903*ae771770SStanislav Sedov
1904*ae771770SStanislav Sedov
1905*ae771770SStanislav Sedov
1906*ae771770SStanislav SedovCostello                    Standards Track                    [Page 34]
1907*ae771770SStanislav Sedov
1908*ae771770SStanislav SedovRFC 3492                     IDNA Punycode                    March 2003
1909*ae771770SStanislav Sedov
1910*ae771770SStanislav Sedov
1911*ae771770SStanislav SedovFull Copyright Statement
1912*ae771770SStanislav Sedov
1913*ae771770SStanislav Sedov   Copyright (C) The Internet Society (2003).  All Rights Reserved.
1914*ae771770SStanislav Sedov
1915*ae771770SStanislav Sedov   This document and translations of it may be copied and furnished to
1916*ae771770SStanislav Sedov   others, and derivative works that comment on or otherwise explain it
1917*ae771770SStanislav Sedov   or assist in its implementation may be prepared, copied, published
1918*ae771770SStanislav Sedov   and distributed, in whole or in part, without restriction of any
1919*ae771770SStanislav Sedov   kind, provided that the above copyright notice and this paragraph are
1920*ae771770SStanislav Sedov   included on all such copies and derivative works.  However, this
1921*ae771770SStanislav Sedov   document itself may not be modified in any way, such as by removing
1922*ae771770SStanislav Sedov   the copyright notice or references to the Internet Society or other
1923*ae771770SStanislav Sedov   Internet organizations, except as needed for the purpose of
1924*ae771770SStanislav Sedov   developing Internet standards in which case the procedures for
1925*ae771770SStanislav Sedov   copyrights defined in the Internet Standards process must be
1926*ae771770SStanislav Sedov   followed, or as required to translate it into languages other than
1927*ae771770SStanislav Sedov   English.
1928*ae771770SStanislav Sedov
1929*ae771770SStanislav Sedov   The limited permissions granted above are perpetual and will not be
1930*ae771770SStanislav Sedov   revoked by the Internet Society or its successors or assigns.
1931*ae771770SStanislav Sedov
1932*ae771770SStanislav Sedov   This document and the information contained herein is provided on an
1933*ae771770SStanislav Sedov   "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING
1934*ae771770SStanislav Sedov   TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING
1935*ae771770SStanislav Sedov   BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION
1936*ae771770SStanislav Sedov   HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF
1937*ae771770SStanislav Sedov   MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
1938*ae771770SStanislav Sedov
1939*ae771770SStanislav SedovAcknowledgement
1940*ae771770SStanislav Sedov
1941*ae771770SStanislav Sedov   Funding for the RFC Editor function is currently provided by the
1942*ae771770SStanislav Sedov   Internet Society.
1943*ae771770SStanislav Sedov
1944*ae771770SStanislav Sedov
1945*ae771770SStanislav Sedov
1946*ae771770SStanislav Sedov
1947*ae771770SStanislav Sedov
1948*ae771770SStanislav Sedov
1949*ae771770SStanislav Sedov
1950*ae771770SStanislav Sedov
1951*ae771770SStanislav Sedov
1952*ae771770SStanislav Sedov
1953*ae771770SStanislav Sedov
1954*ae771770SStanislav Sedov
1955*ae771770SStanislav Sedov
1956*ae771770SStanislav Sedov
1957*ae771770SStanislav Sedov
1958*ae771770SStanislav Sedov
1959*ae771770SStanislav Sedov
1960*ae771770SStanislav Sedov
1961*ae771770SStanislav Sedov
1962*ae771770SStanislav SedovCostello                    Standards Track                    [Page 35]
1963*ae771770SStanislav Sedov
1964