1.\"- 2.\" Copyright (c) 2015 Matthew Dillon 3.\" All rights reserved. 4.\" 5.\" Redistribution and use in source and binary forms, with or without 6.\" modification, are permitted provided that the following conditions 7.\" are met: 8.\" 1. Redistributions of source code must retain the above copyright 9.\" notice, this list of conditions and the following disclaimer. 10.\" 2. Redistributions in binary form must reproduce the above copyright 11.\" notice, this list of conditions and the following disclaimer in the 12.\" documentation and/or other materials provided with the distribution. 13.\" 14.\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND 15.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 16.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE 17.\" ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE 18.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 19.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS 20.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) 21.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT 22.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY 23.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF 24.\" SUCH DAMAGE. 25.\" 26.Dd August 24, 2015 27.Dt MBINTOWCR 3 28.Os 29.Sh NAME 30.Nm mbintowcr , 31.Nm mbintowcr_l , 32.Nm utf8towcr , 33.Nm wcrtombin , 34.Nm wcrtombin_l , 35.Nm wcrtoutf8 36.Nd "8-bit-clean wchar conversion w/escaping or validation" 37.Sh LIBRARY 38.Lb libc 39.Sh SYNOPSIS 40.In wchar.h 41.Ft size_t 42.Fo mbintowcr 43.Fa "wchar_t * restrict dst" "const char * restrict src" 44.Fa "size_t dlen" "size_t *slen" "int flags" 45.Fc 46.Ft size_t 47.Fo utf8towcr 48.Fa "wchar_t * restrict dst" "const char * restrict src" 49.Fa "size_t dlen" "size_t *slen" "int flags" 50.Fc 51.Ft size_t 52.Fo wcrtombin 53.Fa "char * restrict dst" "const wchar_t * restrict src" 54.Fa "size_t dlen" "size_t *slen" "int flags" 55.Fc 56.Ft size_t 57.Fo wcrtoutf8 58.Fa "char * restrict dst" "const wchar_t * restrict src" 59.Fa "size_t dlen" "size_t *slen" "int flags" 60.Fc 61.In xlocale.h 62.Ft size_t 63.Fo mbintowcr_l 64.Fa "wchar_t * restrict dst" "const char * restrict src" 65.Fa "size_t dlen" "size_t *slen" "locale_t locale" "int flags" 66.Fc 67.Ft size_t 68.Fo wcrtombin_l 69.Fa "char * restrict dst" "const wchar_t * restrict src" 70.Fa "size_t dlen" "size_t *slen" "locale_t locale" "int flags" 71.Fc 72.Sh DESCRIPTION 73The 74.Fn mbintowcr 75and 76.Fn wcrtombin 77functions translate byte data into wide-char format and back again. 78Under normal conditions (but not with all flags) these functions 79guarantee that the round-trip will be 8-bit-clean. 80Some care must be taken to properly specify the 81.Dv WCSBIN_EOF 82flag to properly handle trailing incomplete sequences at stream EOF. 83.Pp 84For the "C" locale these functions are 1:1 (do not convert UTF-8). 85For UTF-8 locales these functions convert to/from UTF-8. 86Most of the discussion below pertains to UTF-8 translations. 87.Pp 88The 89.Fn utf8towcr 90and 91.Fn wcrtoutf8 92functions do exactly the same thing as the above functions but are locked 93to the UTF-8 locale. 94That is, these functions work regardless of which localehas been selected 95and also do not require any initial 96.Fn setlocale 97call to initialize. 98Applications working explicitly in UTF-8 should use these versions. 99.Pp 100Any illegal sequences will be escaped using UTF-8B (U+DC80 - U+DCFF). 101Illegal sequences include surrogate-space encodings, non-canonical encodings, 102codings >= 0x10FFFF, 5-byte and 6-byte codings (which are not legal any more), 103and malformed codings. 104Flags may be used to modify this behavior. 105.Pp 106The 107.Fn mbintowcr 108function takes generic 8-bit byte data as its input which the caller 109expects to be loosely coded in UTF-8 and converts it to an array of 110.Vt wchar_t , 111and returns the number of 112.Vt wchar_t 113that were converted. 114The caller must set 115.Fa *slen 116to the number of bytes in the input buffer and the function will 117set 118.Fa *slen 119on return to the number of bytes in the input buffer that were processed. 120.Pp 121Fewer bytes than specified might be processed due to the output buffer 122reaching its limit or due to an incomplete sequence at the end of the input 123buffer when the 124.Dv WCSBIN_EOF 125flag has not been specified. 126.Pp 127If processing a stream, the caller 128typically copies any unprocessed data at the end of the buffer back to 129the beginning and then continues loading the buffer from there. 130Be sure to check for an incomplete translation at stream EOF and do a 131final translation of the remainder with the 132.Dv WCSBIN_EOF 133flag set. 134.Pp 135This function will always generate escapes for illegal UTF-8 code sequences 136and by can produce a clean BYTE-WCHAR-BYTE conversion. 137See the flags description later on. 138.Pp 139This function cannot return an error unless the 140.Dv WCSBIN_STRICT 141flag is set. 142In case of error, any valid conversions are returned first and the caller 143is expected to iterate. 144The error is returned when it becomes the first element of the buffer. 145.Pp 146A 147.Dv NULL 148destination buffer may be specified in which case this function operates 149identically except for actually trying to fill the buffer. 150This feature is typically used for validation with 151.Dv WCSBIN_STRICT 152and sometimes also used in combination with 153.Dv WCSBIN_SURRO 154(set if you want to allow surrogates). 155.Pp 156The 157.Fn wcrtombin 158function takes an array of 159.Vt wchar_t 160as its input which is usually expected to be well-formed and converts it 161to an array of generic 8-bit byte data. 162The caller must set 163.Fa *slen 164to the number of elements in the input buffer and the function will set 165.Fa *slen 166on return to the number of elements in the input buffer that were processed. 167.Pp 168Be sure to properly set the 169.Dv WCSBIN_EOF 170flag for the last buffer at stream EOF. 171.Pp 172This function can return an error regardless of the flags if a supplied 173wchar code is out of range. 174Some flags change the range of allowed wchar codes. 175In case of error, any valid conversions are returned first and the 176caller is expected to iterate. 177The error is returned when it becomes the first element of the buffer. 178.Pp 179A 180.Dv NULL 181destination buffer may be specified in which case this function operates 182identically except for actually trying to fill the buffer. 183This feature is typically used for validation with or without 184.Dv WCSBIN_STRICT 185and sometimes also used in combination with 186.Dv WCSBIN_SURRO . 187.Pp 188One final note on the use of 189.Dv WCSBIN_SURRO 190for wchars-to-bytes. 191If this flag 192is not set surrogates in the escape range will be de-escaped (giving us our 1938-bit-clean round-trip), and other surrogates will be passed through as UTF-8 194encodings. 195In 196.Dv WCSBIN_STRICT 197mode this flag works slightly differently. 198If not specified no surrogates are allowed at all (escaped or otherwise), 199and if specified all surrogates are allowed and will never be de-escaped. 200.Pp 201The _l-suffixed versions of 202.Fn mbintowcr 203and 204.Fn wcrtombin 205take an explicit 206.Fa locale 207argument, whereas the 208non-suffixed versions use the current global or per-thread locale. 209.Sh UTF-8B ESCAPE SEQUENCES 210Escaping is handled by converting one or more bytes in the byte sequence to 211the UTF-8B escape wchar (U+DC80 - U+DCFF). 212Most illegal sequences escape the first byte and then reprocess the remaining 213bytes. 214An illegal byte 215sequence length (5 or 6 bytes), non-canonical encoding, or illegal wchar value 216(beyond 0x10FFFF if not modified by flags) will escape all bytes in the 217sequence as long as they were not malformed. 218.Pp 219When converting back to a byte-sequence, if not modified by flags, UTF-8B 220escape wchars are converted back to their original bytes. 221Other surrogate codes (U+D800 - U+DFFF which are normally illegal) will be 222passed through and encoded as UTF-8. 223.Sh FLAGS 224.Bl -tag -width ".Dv WCSBIN_LONGCODES" 225.It Dv WCSBIN_EOF 226Indicate that the input buffer represents the last of the input stream. 227This causes any partial sequences at the end of the input buffer to be 228processed. 229.It Dv WCSBIN_SURRO 230This flag passes-through any surrogate codes that are already UTF-8-encoded. 231This is normally illegal but if you are processing a stream which has already 232been UTF-8B escaped this flag will prevent the U+DC80 - U+DCFF codes from 233being re-escaped bytes-to-wchars and will prevent decoding back to the 234original bytes wchars-to-bytes. 235This flag is sometimes used on input if the 236caller expects the input stream to already be escaped, and not usually used 237on output unless the caller explicitly wants to encode to an intermediate 238illegal UTF-8 encoding that retains the escapes as escapes. 239.Pp 240This flag does not prevent additional escapes from being translated on 241bytes-to-wchars 242.Dv ( WCSBIN_STRICT 243prevents escaping on bytes-to-wchars), but 244will prevent de-escaping on wchars-to-bytes. 245.Pp 246This flag breaks round-trip 8-bit-clean operation since escape codes use 247the surrogate space and will mix with surrogates that are passed through 248on input by this flag in a way that cannot be distinguished. 249.It Dv WCSBIN_LONGCODES 250Specifying this flag in the bytes-to-wchars direction allows for decoding 251of legacy 5-byte and 6-byte sequences as well as 4-byte sequences which 252would normally be illegal. 253These sequences are illegal and this flag should 254not normally be used unless the caller explicitly wants to handle the legacy 255case. 256.Pp 257Specifying this flag in the wchars-to-bytes direction allows normally illegal 258wchars to be encoded. 259Again, not recommended. 260.Pp 261This flag does not allow decoding non-canonical sequences. 262Such sequences will still be escaped. 263.It Dv WCSBIN_STRICT 264This flag forces strict parsing in the bytes-to-wchars direction and will 265cause 266.Fn mbintowcr 267to process short or return with an error once processing reaches the 268illegal coding rather than escaping the illegal sequence. 269This flag is usually specified only when the caller desires to validate 270a UTF-8 buffer. 271Remember that an error may also be present with return values greater than 0. 272A partial sequences at the end of the buffer is not 273considered to be an error unless 274.Dv WCSBIN_EOF 275is also specified. 276.Pp 277Caller is reminded that when using this feature for validation, a 278short-return can happen rather than an error if the error is not at the 279base of the source or if 280.Dv WCSBIN_EOF 281is not specified. 282If the caller is not chaining buffers then 283.Dv WCSBIN_EOF 284should be specified and a simple check of whether 285.Fa *slen 286equals the original input buffer length on return is sufficient to determine 287if an error occurred or not. 288If the caller is chaining buffers 289.Dv WCSBIN_EOF 290is not specified and the caller must proceed with the copy-down / continued 291buffer loading loop to distinguish between an incomplete buffer and an error. 292.El 293.Sh RETURN VALUES 294The 295.Fn mbintowcr , 296.Fn mbintowcr_l , 297.Fn utf8towcr , 298.Fn wcrtombin , 299.Fn wcrtombin_l 300and 301.Fn wcrtoutf8 302functions return the number of output elements generated and set 303.Fa *slen 304to the number of input elements converted. 305If an error occurs but the output buffer has already been populated, 306a short return will occur and the next iteration where the error is 307the first element will return the error. 308The caller is responsible for processing any error conditions before 309continuing. 310.Pp 311The 312.Fn mbintowcr , 313.Fn mbintowcr_l 314and 315.Fn utf8towcr 316functions can return a (size_t)-1 error if 317.Dv WCSBIN_STRICT 318is specified, and otherwise cannot. 319.Pp 320The 321.Fn wcrtombin , 322.Fn wcrtombin_l 323and 324.Fn wcrtoutf8 325functions can return a (size_t)-1 error if given an illegal wchar code, 326as modified by 327.Fa flags . 328Any wchar code >= 0x80000000U always causes an error to be returned. 329.Sh ERRORS 330If an error is returned, errno will be set to 331.Er EILSEQ . 332.Sh SEE ALSO 333.Xr mbtowc 3 , 334.Xr multibyte 3 , 335.Xr setlocale 3 , 336.Xr wcrtomb 3 , 337.Xr xlocale 3 338.Sh STANDARDS 339The 340.Fn mbintowcr , 341.Fn mbintowcr_l , 342.Fn utf8towcr , 343.Fn wcrtombin , 344.Fn wcrtombin_l 345and 346.Fn wcrtoutf8 347functions are non-standard extensions to libc. 348