1#lang scribble/doc
2@(require "utils.rkt")
3
4@bc-title[#:tag "im:encodings"]{String Encodings}
5
6The @cpp{scheme_utf8_decode} function decodes a @cpp{char} array as
7UTF-8 into either a UCS-4 @cpp{mzchar} array or a UTF-16 @cpp{short}
8array. The @cpp{scheme_utf8_encode} function encodes either a UCS-4
9@cpp{mzchar} array or a UTF-16 @cpp{short} array into a UTF-8
10@cpp{char} array.
11
12These functions can be used to check or measure an encoding or
13decoding without actually producing the result decoding or encoding,
14and variations of the function provide control over the handling of
15decoding errors.
16
17@function[(int scheme_utf8_decode
18           [const-unsigned-char* s]
19           [int start]
20           [int end]
21           [mzchar* us]
22           [int dstart]
23           [int dend]
24           [intptr_t* ipos]
25           [char utf16]
26           [int permissive])]{
27
28Decodes a byte array as UTF-8 to produce either Unicode code points
29 into @var{us} (when @var{utf16} is zero) or UTF-16 code units into
30 @var{us} cast to @cpp{short*} (when @var{utf16} is non-zero). No nul
31 terminator is added to @var{us}.
32
33The result is non-negative when all of the given bytes are decoded,
34 and the result is the length of the decoding (in @cpp{mzchar}s or
35 @cpp{short}s). A @cpp{-2} result indicates an invalid encoding
36 sequence in the given bytes (possibly because the range to decode
37 ended mid-encoding), and a @cpp{-3} result indicates that decoding
38 stopped because not enough room was available in the result string.
39
40The @var{start} and @var{end} arguments specify a range of @var{s} to
41 be decoded. If @var{end} is negative, @cpp{strlen(@var{s})} is used
42 as the end.
43
44If @var{us} is @cpp{NULL}, then decoded bytes are not produced, but
45 the result is valid as if decoded bytes were written. The
46 @var{dstart} and @var{dend} arguments specify a target range in
47 @var{us} (in @cpp{mzchar} or @cpp{short} units) for the decoding; a
48 negative value for @var{dend} indicates that any number of bytes can
49 be written to @var{us}, which is normally sensible only when @var{us}
50 is @cpp{NULL} for measuring the length of the decoding.
51
52If @var{ipos} is non-@cpp{NULL}, it is filled with the first undecoded
53 index within @var{s}. If the function result is non-negative, then
54 @cpp{*@var{ipos}} is set to the ending index (with is @var{end} if
55 non-negative, @cpp{strlen(@var{s})} otherwise). If the result is
56 @cpp{-1} or @cpp{-2}, then @cpp{*@var{ipos}} effectively indicates
57 how many bytes were decoded before decoding stopped.
58
59If @var{permissive} is non-zero, it is used as the decoding of bytes
60 that are not part of a valid UTF-8 encoding or if the input ends in the
61 middle of an encoding. Thus, the function
62 result can be @cpp{-1} or @cpp{-2} only if @var{permissive} is @cpp{0}.
63
64On Windows, when @var{utf16} is non-zero, decoding supports a natural
65 extension of UTF-8 that can produce unpaired UTF-16 surrogates in the
66 result.
67
68This function does not allocate or trigger garbage collection.}
69
70@function[(int scheme_utf8_decode_offset_prefix
71           [const-unsigned-char* s]
72           [int start]
73           [int end]
74           [mzchar* us]
75           [int dstart]
76           [int dend]
77           [intptr_t* ipos]
78           [char utf16]
79           [int permissive])]{
80
81Like @cpp{scheme_utf8_decode}, but returns @cpp{-1} if the input ends
82in the middle of a UTF-8 encoding even if @var{permission} is
83non-zero.
84
85@history[#:added "6.0.1.13"]}
86
87
88@function[(int scheme_utf8_decode_as_prefix
89           [const-unsigned-char* s]
90           [int start]
91           [int end]
92           [mzchar* us]
93           [int dstart]
94           [int dend]
95           [intptr_t* ipos]
96           [char utf16]
97           [int permissive])]{
98
99Like @cpp{scheme_utf8_decode}, but the result is always the number
100 of the decoded @cpp{mzchar}s or @cpp{short}s. If a decoding error is
101 encountered, the result is still the size of the decoding up until
102 the error.}
103
104@function[(int scheme_utf8_decode_all
105           [const-unsigned-char* s]
106           [int len]
107           [mzchar* us]
108           [int permissive])]{
109
110Like @cpp{scheme_utf8_decode}, but with fewer arguments. The
111 decoding produces UCS-4 @cpp{mzchar}s. If the buffer @var{us} is
112 non-@cpp{NULL}, it is assumed to be long enough to hold the decoding
113 (which cannot be longer than the length of the input, though it may
114 be shorter). If @var{len} is negative, @cpp{strlen(@var{s})} is used
115 as the input length.}
116
117
118@function[(int scheme_utf8_decode_prefix
119           [const-unsigned-char* s]
120           [int len]
121           [mzchar* us]
122           [int permissive])]{
123
124Like @cpp{scheme_utf8_decode}, but with fewer arguments. The
125 decoding produces UCS-4 @cpp{mzchar}s. The buffer @var{us}
126 @bold{must} be non-@cpp{NULL}, and it is assumed to be long enough to hold the
127 decoding (which cannot be longer than the length of the input, though
128 it may be shorter). If @var{len} is negative, @cpp{strlen(@var{s})}
129 is used as the input length.
130
131In addition to the result of @cpp{scheme_utf8_decode}, the result
132 can be @cpp{-1} to indicate that the input ended with a partial
133 (valid) encoding. A @cpp{-1} result is possible even when
134 @var{permissive} is non-zero.}
135
136@function[(mzchar* scheme_utf8_decode_to_buffer
137           [const-unsigned-char* s]
138           [int len]
139           [mzchar* buf]
140           [int blen])]{
141
142Like @cpp{scheme_utf8_decode_all} with @var{permissive} as @cpp{0},
143 but if @var{buf} is not large enough (as indicated by @var{blen}) to
144 hold the result, a new buffer is allocated. Unlike other functions,
145 this one adds a nul terminator to the decoding result. The function
146 result is either @var{buf} (if it was big enough) or a buffer
147 allocated with @cpp{scheme_malloc_atomic}.}
148
149@function[(mzchar* scheme_utf8_decode_to_buffer_len
150           [const-unsigned-char* s]
151           [int len]
152           [mzchar* buf]
153           [int blen]
154           [intptr_t* ulen])]{
155
156Like @cpp{scheme_utf8_decode_to_buffer}, but the length of the
157 result (not including the terminator) is placed into @var{ulen} if
158 @var{ulen} is non-@cpp{NULL}.}
159
160@function[(int scheme_utf8_decode_count
161           [const-unsigned-char* s]
162           [int start]
163           [int end]
164           [int* state]
165           [int might_continue]
166           [int permissive])]{
167
168Like @cpp{scheme_utf8_decode}, but without producing the decoded
169 @cpp{mzchar}s, and always returning the number of decoded
170 @cpp{mzchar}s up until a decoding error (if any). If
171 @var{might_continue} is non-zero, the a partial valid encoding at
172 the end of the input is not decoded when @var{permissive} is also
173 non-zero.
174
175If @var{state} is non-@cpp{NULL}, it holds information about partial
176 encodings; it should be set to zero for an initial call, and then
177 passed back to @cpp{scheme_utf8_decode} along with bytes that
178 extend the given input (i.e., without any unused partial
179 encodings). Typically, this mode makes sense only when
180 @var{might_continue} and @var{permissive} are non-zero.}
181
182
183@function[(int scheme_utf8_encode
184           [const-mzchar* us]
185           [int start]
186           [int end]
187           [unsigned-char* s]
188           [int dstart]
189           [char utf16])]{
190
191Encodes the given UCS-4 array of @cpp{mzchar}s (if @var{utf16} is
192 zero) or UTF-16 array of @cpp{short}s (if @var{utf16} is non-zero)
193 into @var{s}. The @var{end} argument must be no less than
194 @var{start}.
195
196The array @var{s} is assumed to be long enough to contain the
197 encoding, but no encoding is written if @var{s} is @cpp{NULL}. The
198 @var{dstart} argument indicates a starting place in @var{s} to hold
199 the encoding. No nul terminator is added to @var{s}.
200
201The result is the number of bytes produced for the encoding (or that
202 would be produced if @var{s} was non-@cpp{NULL}). Encoding never
203 fails.
204
205On Windows, when @var{utf16} is non-zero, encoding supports unpaired
206 surrogates the input UTF-16 code-unit sequence, in which case
207 encoding generates a natural extension of UTF-8 that encodes unpaired
208 surrogates.
209
210This function does not allocate or trigger garbage collection.}
211
212@function[(int scheme_utf8_encode_all
213           [const-mzchar* us]
214           [int len]
215           [unsigned-char* s])]{
216
217Like @cpp{scheme_utf8_encode} with @cpp{0} for @var{start},
218 @var{len} for @var{end}, @cpp{0} for @var{dstart} and @cpp{0} for
219 @var{utf16}.}
220
221
222@function[(char* scheme_utf8_encode_to_buffer
223           [const-mzchar* s]
224           [int len]
225           [char* buf]
226           [int blen])]{
227
228Like @cpp{scheme_utf8_encode_all}, but the length of @var{buf} is
229 given, and if it is not long enough to hold the encoding, a buffer is
230 allocated. A nul terminator is added to the encoded array. The result
231 is either @var{buf} or an array allocated with
232 @cpp{scheme_malloc_atomic}.}
233
234@function[(char* scheme_utf8_encode_to_buffer_len
235           [const-mzchar* s]
236           [int len]
237           [char* buf]
238           [int blen]
239           [intptr_t* rlen])]{
240
241Like @cpp{scheme_utf8_encode_to_buffer}, but the length of the
242 resulting encoding (not including a nul terminator) is reported in
243 @var{rlen} if it is non-@cpp{NULL}.}
244
245
246@function[(unsigned-short* scheme_ucs4_to_utf16
247           [const-mzchar* text]
248           [int start]
249           [int end]
250           [unsigned-short* buf]
251           [int bufsize]
252           [intptr_t* ulen]
253           [int term_size])]{
254
255Converts a UCS-4 encoding (the indicated range of @var{text}) to a
256 UTF-16 encoding. The @var{end} argument must be no less than
257 @var{start}.
258
259A result buffer is allocated if @var{buf} is not long enough (as
260 indicated by @var{bufsize}). If @var{ulen} is non-@cpp{NULL}, it is
261 filled with the length of the UTF-16 encoding. The @var{term_size}
262 argument indicates a number of @cpp{short}s to reserve at the end of
263 the result buffer for a terminator (but no terminator is actually
264 written).}
265
266@function[(mzchar* scheme_utf16_to_ucs4
267           [const-unsigned-short* text]
268           [int start]
269           [int end]
270           [mzchar* buf]
271           [int bufsize]
272           [intptr_t* ulen]
273           [int term_size])]{
274
275Converts a UTF-16 encoding (the indicated range of @var{text}) to a
276 UCS-4 encoding. The @var{end} argument must be no less than
277 @var{start}.
278
279A result buffer is allocated if @var{buf} is not long enough (as
280 indicated by @var{bufsize}). If @var{ulen} is non-@cpp{NULL}, it is
281 filled with the length of the UCS-4 encoding. The @var{term_size}
282 argument indicates a number of @cpp{mzchar}s to reserve at the end of
283 the result buffer for a terminator (but no terminator is actually
284 written).}
285