1#lang scribble/doc 2@(require "utils.rkt") 3 4@bc-title[#:tag "im:encodings"]{String Encodings} 5 6The @cpp{scheme_utf8_decode} function decodes a @cpp{char} array as 7UTF-8 into either a UCS-4 @cpp{mzchar} array or a UTF-16 @cpp{short} 8array. The @cpp{scheme_utf8_encode} function encodes either a UCS-4 9@cpp{mzchar} array or a UTF-16 @cpp{short} array into a UTF-8 10@cpp{char} array. 11 12These functions can be used to check or measure an encoding or 13decoding without actually producing the result decoding or encoding, 14and variations of the function provide control over the handling of 15decoding errors. 16 17@function[(int scheme_utf8_decode 18 [const-unsigned-char* s] 19 [int start] 20 [int end] 21 [mzchar* us] 22 [int dstart] 23 [int dend] 24 [intptr_t* ipos] 25 [char utf16] 26 [int permissive])]{ 27 28Decodes a byte array as UTF-8 to produce either Unicode code points 29 into @var{us} (when @var{utf16} is zero) or UTF-16 code units into 30 @var{us} cast to @cpp{short*} (when @var{utf16} is non-zero). No nul 31 terminator is added to @var{us}. 32 33The result is non-negative when all of the given bytes are decoded, 34 and the result is the length of the decoding (in @cpp{mzchar}s or 35 @cpp{short}s). A @cpp{-2} result indicates an invalid encoding 36 sequence in the given bytes (possibly because the range to decode 37 ended mid-encoding), and a @cpp{-3} result indicates that decoding 38 stopped because not enough room was available in the result string. 39 40The @var{start} and @var{end} arguments specify a range of @var{s} to 41 be decoded. If @var{end} is negative, @cpp{strlen(@var{s})} is used 42 as the end. 43 44If @var{us} is @cpp{NULL}, then decoded bytes are not produced, but 45 the result is valid as if decoded bytes were written. The 46 @var{dstart} and @var{dend} arguments specify a target range in 47 @var{us} (in @cpp{mzchar} or @cpp{short} units) for the decoding; a 48 negative value for @var{dend} indicates that any number of bytes can 49 be written to @var{us}, which is normally sensible only when @var{us} 50 is @cpp{NULL} for measuring the length of the decoding. 51 52If @var{ipos} is non-@cpp{NULL}, it is filled with the first undecoded 53 index within @var{s}. If the function result is non-negative, then 54 @cpp{*@var{ipos}} is set to the ending index (with is @var{end} if 55 non-negative, @cpp{strlen(@var{s})} otherwise). If the result is 56 @cpp{-1} or @cpp{-2}, then @cpp{*@var{ipos}} effectively indicates 57 how many bytes were decoded before decoding stopped. 58 59If @var{permissive} is non-zero, it is used as the decoding of bytes 60 that are not part of a valid UTF-8 encoding or if the input ends in the 61 middle of an encoding. Thus, the function 62 result can be @cpp{-1} or @cpp{-2} only if @var{permissive} is @cpp{0}. 63 64On Windows, when @var{utf16} is non-zero, decoding supports a natural 65 extension of UTF-8 that can produce unpaired UTF-16 surrogates in the 66 result. 67 68This function does not allocate or trigger garbage collection.} 69 70@function[(int scheme_utf8_decode_offset_prefix 71 [const-unsigned-char* s] 72 [int start] 73 [int end] 74 [mzchar* us] 75 [int dstart] 76 [int dend] 77 [intptr_t* ipos] 78 [char utf16] 79 [int permissive])]{ 80 81Like @cpp{scheme_utf8_decode}, but returns @cpp{-1} if the input ends 82in the middle of a UTF-8 encoding even if @var{permission} is 83non-zero. 84 85@history[#:added "6.0.1.13"]} 86 87 88@function[(int scheme_utf8_decode_as_prefix 89 [const-unsigned-char* s] 90 [int start] 91 [int end] 92 [mzchar* us] 93 [int dstart] 94 [int dend] 95 [intptr_t* ipos] 96 [char utf16] 97 [int permissive])]{ 98 99Like @cpp{scheme_utf8_decode}, but the result is always the number 100 of the decoded @cpp{mzchar}s or @cpp{short}s. If a decoding error is 101 encountered, the result is still the size of the decoding up until 102 the error.} 103 104@function[(int scheme_utf8_decode_all 105 [const-unsigned-char* s] 106 [int len] 107 [mzchar* us] 108 [int permissive])]{ 109 110Like @cpp{scheme_utf8_decode}, but with fewer arguments. The 111 decoding produces UCS-4 @cpp{mzchar}s. If the buffer @var{us} is 112 non-@cpp{NULL}, it is assumed to be long enough to hold the decoding 113 (which cannot be longer than the length of the input, though it may 114 be shorter). If @var{len} is negative, @cpp{strlen(@var{s})} is used 115 as the input length.} 116 117 118@function[(int scheme_utf8_decode_prefix 119 [const-unsigned-char* s] 120 [int len] 121 [mzchar* us] 122 [int permissive])]{ 123 124Like @cpp{scheme_utf8_decode}, but with fewer arguments. The 125 decoding produces UCS-4 @cpp{mzchar}s. The buffer @var{us} 126 @bold{must} be non-@cpp{NULL}, and it is assumed to be long enough to hold the 127 decoding (which cannot be longer than the length of the input, though 128 it may be shorter). If @var{len} is negative, @cpp{strlen(@var{s})} 129 is used as the input length. 130 131In addition to the result of @cpp{scheme_utf8_decode}, the result 132 can be @cpp{-1} to indicate that the input ended with a partial 133 (valid) encoding. A @cpp{-1} result is possible even when 134 @var{permissive} is non-zero.} 135 136@function[(mzchar* scheme_utf8_decode_to_buffer 137 [const-unsigned-char* s] 138 [int len] 139 [mzchar* buf] 140 [int blen])]{ 141 142Like @cpp{scheme_utf8_decode_all} with @var{permissive} as @cpp{0}, 143 but if @var{buf} is not large enough (as indicated by @var{blen}) to 144 hold the result, a new buffer is allocated. Unlike other functions, 145 this one adds a nul terminator to the decoding result. The function 146 result is either @var{buf} (if it was big enough) or a buffer 147 allocated with @cpp{scheme_malloc_atomic}.} 148 149@function[(mzchar* scheme_utf8_decode_to_buffer_len 150 [const-unsigned-char* s] 151 [int len] 152 [mzchar* buf] 153 [int blen] 154 [intptr_t* ulen])]{ 155 156Like @cpp{scheme_utf8_decode_to_buffer}, but the length of the 157 result (not including the terminator) is placed into @var{ulen} if 158 @var{ulen} is non-@cpp{NULL}.} 159 160@function[(int scheme_utf8_decode_count 161 [const-unsigned-char* s] 162 [int start] 163 [int end] 164 [int* state] 165 [int might_continue] 166 [int permissive])]{ 167 168Like @cpp{scheme_utf8_decode}, but without producing the decoded 169 @cpp{mzchar}s, and always returning the number of decoded 170 @cpp{mzchar}s up until a decoding error (if any). If 171 @var{might_continue} is non-zero, the a partial valid encoding at 172 the end of the input is not decoded when @var{permissive} is also 173 non-zero. 174 175If @var{state} is non-@cpp{NULL}, it holds information about partial 176 encodings; it should be set to zero for an initial call, and then 177 passed back to @cpp{scheme_utf8_decode} along with bytes that 178 extend the given input (i.e., without any unused partial 179 encodings). Typically, this mode makes sense only when 180 @var{might_continue} and @var{permissive} are non-zero.} 181 182 183@function[(int scheme_utf8_encode 184 [const-mzchar* us] 185 [int start] 186 [int end] 187 [unsigned-char* s] 188 [int dstart] 189 [char utf16])]{ 190 191Encodes the given UCS-4 array of @cpp{mzchar}s (if @var{utf16} is 192 zero) or UTF-16 array of @cpp{short}s (if @var{utf16} is non-zero) 193 into @var{s}. The @var{end} argument must be no less than 194 @var{start}. 195 196The array @var{s} is assumed to be long enough to contain the 197 encoding, but no encoding is written if @var{s} is @cpp{NULL}. The 198 @var{dstart} argument indicates a starting place in @var{s} to hold 199 the encoding. No nul terminator is added to @var{s}. 200 201The result is the number of bytes produced for the encoding (or that 202 would be produced if @var{s} was non-@cpp{NULL}). Encoding never 203 fails. 204 205On Windows, when @var{utf16} is non-zero, encoding supports unpaired 206 surrogates the input UTF-16 code-unit sequence, in which case 207 encoding generates a natural extension of UTF-8 that encodes unpaired 208 surrogates. 209 210This function does not allocate or trigger garbage collection.} 211 212@function[(int scheme_utf8_encode_all 213 [const-mzchar* us] 214 [int len] 215 [unsigned-char* s])]{ 216 217Like @cpp{scheme_utf8_encode} with @cpp{0} for @var{start}, 218 @var{len} for @var{end}, @cpp{0} for @var{dstart} and @cpp{0} for 219 @var{utf16}.} 220 221 222@function[(char* scheme_utf8_encode_to_buffer 223 [const-mzchar* s] 224 [int len] 225 [char* buf] 226 [int blen])]{ 227 228Like @cpp{scheme_utf8_encode_all}, but the length of @var{buf} is 229 given, and if it is not long enough to hold the encoding, a buffer is 230 allocated. A nul terminator is added to the encoded array. The result 231 is either @var{buf} or an array allocated with 232 @cpp{scheme_malloc_atomic}.} 233 234@function[(char* scheme_utf8_encode_to_buffer_len 235 [const-mzchar* s] 236 [int len] 237 [char* buf] 238 [int blen] 239 [intptr_t* rlen])]{ 240 241Like @cpp{scheme_utf8_encode_to_buffer}, but the length of the 242 resulting encoding (not including a nul terminator) is reported in 243 @var{rlen} if it is non-@cpp{NULL}.} 244 245 246@function[(unsigned-short* scheme_ucs4_to_utf16 247 [const-mzchar* text] 248 [int start] 249 [int end] 250 [unsigned-short* buf] 251 [int bufsize] 252 [intptr_t* ulen] 253 [int term_size])]{ 254 255Converts a UCS-4 encoding (the indicated range of @var{text}) to a 256 UTF-16 encoding. The @var{end} argument must be no less than 257 @var{start}. 258 259A result buffer is allocated if @var{buf} is not long enough (as 260 indicated by @var{bufsize}). If @var{ulen} is non-@cpp{NULL}, it is 261 filled with the length of the UTF-16 encoding. The @var{term_size} 262 argument indicates a number of @cpp{short}s to reserve at the end of 263 the result buffer for a terminator (but no terminator is actually 264 written).} 265 266@function[(mzchar* scheme_utf16_to_ucs4 267 [const-unsigned-short* text] 268 [int start] 269 [int end] 270 [mzchar* buf] 271 [int bufsize] 272 [intptr_t* ulen] 273 [int term_size])]{ 274 275Converts a UTF-16 encoding (the indicated range of @var{text}) to a 276 UCS-4 encoding. The @var{end} argument must be no less than 277 @var{start}. 278 279A result buffer is allocated if @var{buf} is not long enough (as 280 indicated by @var{bufsize}). If @var{ulen} is non-@cpp{NULL}, it is 281 filled with the length of the UCS-4 encoding. The @var{term_size} 282 argument indicates a number of @cpp{mzchar}s to reserve at the end of 283 the result buffer for a terminator (but no terminator is actually 284 written).} 285