xref: /dragonfly/lib/libc/locale/mbintowcr.3 (revision 8f2ce533)
1.\"-
2.\" Copyright (c) 2015 Matthew Dillon
3.\" All rights reserved.
4.\"
5.\" Redistribution and use in source and binary forms, with or without
6.\" modification, are permitted provided that the following conditions
7.\" are met:
8.\" 1. Redistributions of source code must retain the above copyright
9.\"    notice, this list of conditions and the following disclaimer.
10.\" 2. Redistributions in binary form must reproduce the above copyright
11.\"    notice, this list of conditions and the following disclaimer in the
12.\"    documentation and/or other materials provided with the distribution.
13.\"
14.\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
15.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
16.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
17.\" ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
18.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
19.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
20.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
21.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
22.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
23.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
24.\" SUCH DAMAGE.
25.\"
26.Dd August 24, 2015
27.Dt MBINTOWCR 3
28.Os
29.Sh NAME
30.Nm mbintowcr ,
31.Nm mbintowcr_l ,
32.Nm utf8towcr ,
33.Nm wcrtombin ,
34.Nm wcrtombin_l ,
35.Nm wcrtoutf8
36.Nd "8-bit-clean wchar conversion w/escaping or validation"
37.Sh LIBRARY
38.Lb libc
39.Sh SYNOPSIS
40.In wchar.h
41.Ft size_t
42.Fo mbintowcr
43.Fa "wchar_t * restrict dst" "const char * restrict src"
44.Fa "size_t dlen" "size_t *slen" "int flags"
45.Fc
46.Ft size_t
47.Fo utf8towcr
48.Fa "wchar_t * restrict dst" "const char * restrict src"
49.Fa "size_t dlen" "size_t *slen" "int flags"
50.Fc
51.Ft size_t
52.Fo wcrtombin
53.Fa "char * restrict dst" "const wchar_t * restrict src"
54.Fa "size_t dlen" "size_t *slen" "int flags"
55.Fc
56.Ft size_t
57.Fo wcrtoutf8
58.Fa "char * restrict dst" "const wchar_t * restrict src"
59.Fa "size_t dlen" "size_t *slen" "int flags"
60.Fc
61.In xlocale.h
62.Ft size_t
63.Fo mbintowcr_l
64.Fa "wchar_t * restrict dst" "const char * restrict src"
65.Fa "size_t dlen" "size_t *slen" "locale_t locale" "int flags"
66.Fc
67.Ft size_t
68.Fo wcrtombin_l
69.Fa "char * restrict dst" "const wchar_t * restrict src"
70.Fa "size_t dlen" "size_t *slen" "locale_t locale" "int flags"
71.Fc
72.Sh DESCRIPTION
73The
74.Fn mbintowcr
75and
76.Fn wcrtombin
77functions translate byte data into wide-char format and back again.
78Under normal conditions (but not with all flags) these functions
79guarantee that the round-trip will be 8-bit-clean.
80Some care must be taken to properly specify the
81.Dv WCSBIN_EOF
82flag to properly handle trailing incomplete sequences at stream EOF.
83.Pp
84For the "C" locale these functions are 1:1 (do not convert UTF-8).
85For UTF-8 locales these functions convert to/from UTF-8.
86Most of the discussion below pertains to UTF-8 translations.
87.Pp
88The
89.Fn utf8towcr
90and
91.Fn wcrtoutf8
92functions do exactly the same thing as the above functions but are locked
93to the UTF-8 locale.
94That is, these functions work regardless of which localehas been selected
95and also do not require any initial
96.Fn setlocale
97call to initialize.
98Applications working explicitly in UTF-8 should use these versions.
99.Pp
100Any illegal sequences will be escaped using UTF-8B (U+DC80 - U+DCFF).
101Illegal sequences include surrogate-space encodings, non-canonical encodings,
102codings >= 0x10FFFF, 5-byte and 6-byte codings (which are not legal anymore),
103and malformed codings.
104Flags may be used to modify this behavior.
105.Pp
106The
107.Fn mbintowcr
108function takes generic 8-bit byte data as its input which the caller
109expects to be loosely coded in UTF-8 and converts it to an array of
110.Vt wchar_t ,
111and returns the number of
112.Vt wchar_t
113that were converted.
114The caller must set
115.Fa *slen
116to the number of bytes in the input buffer and the function will
117set
118.Fa *slen
119on return to the number of bytes in the input buffer that were processed.
120.Pp
121Fewer bytes than specified might be processed due to the output buffer
122reaching its limit or due to an incomplete sequence at the end of the input
123buffer when the
124.Dv WCSBIN_EOF
125flag has not been specified.
126.Pp
127If processing a stream, the caller
128typically copies any unprocessed data at the end of the buffer back to
129the beginning and then continues loading the buffer from there.
130Be sure to check for an incomplete translation at stream EOF and do a
131final translation of the remainder with the
132.Dv WCSBIN_EOF
133flag set.
134.Pp
135This function will always generate escapes for illegal UTF-8 code sequences
136and by can produce a clean BYTE-WCHAR-BYTE conversion.
137See the flags description later on.
138.Pp
139This function cannot return an error unless the
140.Dv WCSBIN_STRICT
141flag is set.
142In case of error, any valid conversions are returned first and the caller
143is expected to iterate.
144The error is returned when it becomes the first element of the buffer.
145.Pp
146A
147.Dv NULL
148destination buffer may be specified in which case this function operates
149identically except for actually trying to fill the buffer.
150This feature is typically used for validation with
151.Dv WCSBIN_STRICT
152and sometimes also used in combination with
153.Dv WCSBIN_SURRO
154(set if you want to allow surrogates).
155.Pp
156The
157.Fn wcrtombin
158function takes an array of
159.Vt wchar_t
160as its input which is usually expected to be well-formed and converts it
161to an array of generic 8-bit byte data.
162The caller must set
163.Fa *slen
164to the number of elements in the input buffer and the function will set
165.Fa *slen
166on return to the number of elements in the input buffer that were processed.
167.Pp
168Be sure to properly set the
169.Dv WCSBIN_EOF
170flag for the last buffer at stream EOF.
171.Pp
172This function can return an error regardless of the flags if a supplied
173wchar code is out of range.
174Some flags change the range of allowed wchar codes.
175In case of error, any valid conversions are returned first and the
176caller is expected to iterate.
177The error is returned when it becomes the first element of the buffer.
178.Pp
179A
180.Dv NULL
181destination buffer may be specified in which case this function operates
182identically except for actually trying to fill the buffer.
183This feature is typically used for validation with or without
184.Dv WCSBIN_STRICT
185and sometimes also used in combination with
186.Dv WCSBIN_SURRO .
187.Pp
188One final note on the use of
189.Dv WCSBIN_SURRO
190for wchars-to-bytes.
191If this flag
192is not set surrogates in the escape range will be de-escaped (giving us our
1938-bit-clean round-trip), and other surrogates will be passed through as UTF-8
194encodings.
195In
196.Dv WCSBIN_STRICT
197mode this flag works slightly differently.
198If not specified no surrogates are allowed at all (escaped or otherwise),
199and if specified all surrogates are allowed and will never be de-escaped.
200.Pp
201The _l-suffixed versions of
202.Fn mbintowcr
203and
204.Fn wcrtombin
205take an explicit
206.Fa locale
207argument, whereas the
208non-suffixed versions use the current global or per-thread locale.
209.Sh UTF-8B ESCAPE SEQUENCES
210Escaping is handled by converting one or more bytes in the byte sequence to
211the UTF-8B escape wchar (U+DC80 - U+DCFF).
212Most illegal sequences escape the first byte and then reprocess the remaining
213bytes.
214An illegal byte
215sequence length (5 or 6 bytes), non-canonical encoding, or illegal wchar value
216(beyond 0x10FFFF if not modified by flags) will escape all bytes in the
217sequence as long as they were not malformed.
218.Pp
219When converting back to a byte-sequence, if not modified by flags, UTF-8B
220escape wchars are converted back to their original bytes.
221Other surrogate codes (U+D800 - U+DFFF which are normally illegal) will be
222passed through and encoded as UTF-8.
223.Sh FLAGS
224.Bl -tag -width ".Dv WCSBIN_LONGCODES"
225.It Dv WCSBIN_EOF
226Indicate that the input buffer represents the last of the input stream.
227This causes any partial sequences at the end of the input buffer to be
228processed.
229.It Dv WCSBIN_SURRO
230This flag passes-through any surrogate codes that are already UTF-8-encoded.
231This is normally illegal but if you are processing a stream which has already
232been UTF-8B escaped this flag will prevent the U+DC80 - U+DCFF codes from
233being re-escaped bytes-to-wchars and will prevent decoding back to the
234original bytes wchars-to-bytes.
235This flag is sometimes used on input if the
236caller expects the input stream to already be escaped, and not usually used
237on output unless the caller explicitly wants to encode to an intermediate
238illegal UTF-8 encoding that retains the escapes as escapes.
239.Pp
240This flag does not prevent additional escapes from being translated on
241bytes-to-wchars
242.Dv ( WCSBIN_STRICT
243prevents escaping on bytes-to-wchars), but
244will prevent de-escaping on wchars-to-bytes.
245.Pp
246This flag breaks round-trip 8-bit-clean operation since escape codes use
247the surrogate space and will mix with surrogates that are passed through
248on input by this flag in a way that cannot be distinguished.
249.It Dv WCSBIN_LONGCODES
250Specifying this flag in the bytes-to-wchars direction allows for decoding
251of legacy 5-byte and 6-byte sequences as well as 4-byte sequences which
252would normally be illegal.
253These sequences are illegal and this flag should
254not normally be used unless the caller explicitly wants to handle the legacy
255case.
256.Pp
257Specifying this flag in the wchars-to-bytes direction allows normally illegal
258wchars to be encoded.
259Again, not recommended.
260.Pp
261This flag does not allow decoding non-canonical sequences.
262Such sequences will still be escaped.
263.It Dv WCSBIN_STRICT
264This flag forces strict parsing in the bytes-to-wchars direction and will
265cause
266.Fn mbintowcr
267to process short or return with an error once processing reaches the
268illegal coding rather than escaping the illegal sequence.
269This flag is usually specified only when the caller desires to validate
270a UTF-8 buffer.
271Remember that an error may also be present with return values greater than 0.
272A partial sequences at the end of the buffer is not
273considered to be an error unless
274.Dv WCSBIN_EOF
275is also specified.
276.Pp
277Caller is reminded that when using this feature for validation, a
278short-return can happen rather than an error if the error is not at the
279base of the source or if
280.Dv WCSBIN_EOF
281is not specified.
282If the caller is not chaining buffers then
283.Dv WCSBIN_EOF
284should be specified and a simple check of whether
285.Fa *slen
286equals the original input buffer length on return is sufficient to determine
287if an error occurred or not.
288If the caller is chaining buffers
289.Dv WCSBIN_EOF
290is not specified and the caller must proceed with the copy-down / continued
291buffer loading loop to distinguish between an incomplete buffer and an error.
292.El
293.Sh RETURN VALUES
294The
295.Fn mbintowcr ,
296.Fn mbintowcr_l ,
297.Fn utf8towcr ,
298.Fn wcrtombin ,
299.Fn wcrtombin_l
300and
301.Fn wcrtoutf8
302functions return the number of output elements generated and set
303.Fa *slen
304to the number of input elements converted.
305If an error occurs but the output buffer has already been populated,
306a short return will occur and the next iteration where the error is
307the first element will return the error.
308The caller is responsible for processing any error conditions before
309continuing.
310.Pp
311The
312.Fn mbintowcr ,
313.Fn mbintowcr_l
314and
315.Fn utf8towcr
316functions can return a (size_t)-1 error if
317.Dv WCSBIN_STRICT
318is specified, and otherwise cannot.
319.Pp
320The
321.Fn wcrtombin ,
322.Fn wcrtombin_l
323and
324.Fn wcrtoutf8
325functions can return a (size_t)-1 error if given an illegal wchar code,
326as modified by
327.Fa flags .
328Any wchar code >= 0x80000000U always causes an error to be returned.
329.Sh ERRORS
330If an error is returned, errno will be set to
331.Er EILSEQ .
332.Sh SEE ALSO
333.Xr mbtowc 3 ,
334.Xr multibyte 3 ,
335.Xr setlocale 3 ,
336.Xr wcrtomb 3 ,
337.Xr xlocale 3
338.Sh STANDARDS
339The
340.Fn mbintowcr ,
341.Fn mbintowcr_l ,
342.Fn utf8towcr ,
343.Fn wcrtombin ,
344.Fn wcrtombin_l
345and
346.Fn wcrtoutf8
347functions are non-standard extensions to libc.
348