xref: /openbsd/lib/libc/locale/mbrtowc.3 (revision 82f6949e)
1.\" $OpenBSD: mbrtowc.3,v 1.7 2023/09/12 08:33:37 jsg Exp $
2.\" $NetBSD: mbrtowc.3,v 1.5 2003/09/08 17:54:31 wiz Exp $
3.\"
4.\" Copyright (c)2023 Ingo Schwarze <schwarze@openbsd.org>
5.\" Copyright (c)2010 Stefan Sperling <stsp@openbsd.org>
6.\" Copyright (c)2002 Citrus Project,
7.\" All rights reserved.
8.\"
9.\" Redistribution and use in source and binary forms, with or without
10.\" modification, are permitted provided that the following conditions
11.\" are met:
12.\" 1. Redistributions of source code must retain the above copyright
13.\"    notice, this list of conditions and the following disclaimer.
14.\" 2. Redistributions in binary form must reproduce the above copyright
15.\"    notice, this list of conditions and the following disclaimer in the
16.\"    documentation and/or other materials provided with the distribution.
17.\"
18.\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
19.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
20.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
21.\" ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
22.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
23.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
24.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
25.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
26.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
27.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
28.\" SUCH DAMAGE.
29.\"
30.Dd $Mdocdate: September 12 2023 $
31.Dt MBRTOWC 3
32.Os
33.Sh NAME
34.Nm mbrtowc ,
35.Nm mbrtoc32
36.Nd convert a multibyte character to a wide character (restartable)
37.Sh SYNOPSIS
38.In wchar.h
39.Ft size_t
40.Fo mbrtowc
41.Fa "wchar_t * restrict wc"
42.Fa "const char * restrict s"
43.Fa "size_t n"
44.Fa "mbstate_t * restrict mbs"
45.Fc
46.In uchar.h
47.Ft size_t
48.Fo mbrtoc32
49.Fa "char32_t * restrict wc"
50.Fa "const char * restrict s"
51.Fa "size_t n"
52.Fa "mbstate_t * restrict mbs"
53.Fc
54.Sh DESCRIPTION
55The
56.Fn mbrtowc
57and
58.Fn mbrtoc32
59functions examine at most
60.Fa n
61bytes of the multibyte character byte string pointed to by
62.Fa s ,
63convert those bytes to a wide character, and store the wide character into
64.Pf * Fa wc
65if
66.Fa wc
67is not
68.Dv NULL
69and
70.Fa s
71points to a valid character.
72.Pp
73Conversion happens in accordance with the conversion state
74.Pf * Fa mbs ,
75which must be initialized to zero before the application's first call to
76.Fn mbrtowc
77or
78.Fn mbrtoc32 .
79If the previous call did not return
80.Po Vt size_t Pc Ns \-1 ,
81.Fa mbs
82can safely be reused without reinitialization.
83.Pp
84The input encoding that
85.Fn mbrtowc
86and
87.Fn mbrtoc32
88use for
89.Fa s
90is determined by the
91.Dv LC_CTYPE
92category of the current locale.
93If the locale is changed without reinitialization of
94.Pf * Fa mbs ,
95the behaviour is undefined.
96.Pp
97Unlike
98.Xr mbtowc 3 ,
99.Fn mbrtowc
100and
101.Fn mbrtoc32
102accept an incomplete byte sequence pointed to by
103.Fa s
104which does not form a complete character but is potentially part of
105a valid character.
106In this case, both functions consume all such bytes.
107The conversion state saved in
108.Pf * Fa mbs
109will be used to restart the suspended conversion during the next call.
110.Pp
111On systems other than
112.Ox
113that support state-dependent encodings,
114.Fa s
115may point to a special sequence of bytes called a
116.Dq shift sequence .
117Shift sequences switch between character code sets available within an
118encoding scheme.
119One encoding scheme using shift sequences is ISO/IEC 2022-JP, which
120can switch e.g. from ASCII (which uses one byte per character) to
121JIS X 0208 (which uses two bytes per character).
122Shift sequence bytes correspond to no individual wide character, so
123.Fn mbrtowc
124and
125.Fn mbrtoc32
126treat them as if they were part of the subsequent multibyte character.
127Therefore they do contribute to the number of bytes in the multibyte character.
128.Pp
129The following arguments cause special processing:
130.Bl -tag -width 012345678901
131.It Fa wc No == Dv NULL
132The conversion from a multibyte character to a wide character is performed
133and the conversion state may be affected, but the resulting wide character
134is discarded.
135This can be used to find out how many bytes are contained in the
136multibyte character pointed to by
137.Fa s .
138.It Fa s No == Dv NULL
139The arguments
140.Fa wc
141and
142.Fa n
143are ignored and starting or continuing the conversion with an empty string
144is attempted, discarding the conversion result.
145If conversion succeeds, this call always returns zero.
146Unlike
147.Xr mbtowc 3 ,
148the value returned does not indicate whether the current encoding of
149the locale is state-dependent, i.e. uses shift sequences.
150.It Fa mbs No == Dv NULL
151.Fn mbrtowc
152and
153.Fn mbrtoc32
154each use their own internal state object instead of the
155.Fa mbs
156argument.
157Both internal state objects are initialized at startup time of the program,
158and no other libc function ever changes either of them.
159.Pp
160If
161.Fn mbrtowc
162or
163.Fn mbrtoc32
164is called with a
165.Dv NULL
166.Fa mbs
167argument and that call returns
168.Po Vt size_t Pc Ns \-1 ,
169the internal conversion state of the respective function becomes
170permanently undefined and there is no way to reset it to any defined state.
171Consequently, after such a mishap, it is not safe
172to call the same function with a
173.Dv NULL
174.Fa mbs
175argument ever again until the program is terminated.
176.El
177.Sh RETURN VALUES
178.Bl -tag -width 012345678901
179.It 0
180The bytes pointed to by
181.Fa s
182form a terminating NUL character.
183If
184.Fa wc
185is not
186.Dv NULL ,
187a NUL wide character has been stored in the wchar_t object pointed to by
188.Fa wc .
189.It positive
190.Fa s
191points to a valid character, and the value returned is the number of
192bytes completing the character.
193If
194.Fa wc
195is not
196.Dv NULL ,
197the corresponding wide character has been stored in the wchar_t object
198pointed to by
199.Fa wc .
200.It Po Vt size_t Pc Ns \-1
201.Fa s
202points to an illegal byte sequence which does not form a valid multibyte
203character in the current locale, or
204.Fa mbs
205points to an invalid or uninitialized object.
206.Va errno
207is set to
208.Er EILSEQ
209or
210.Er EINVAL ,
211respectively.
212The conversion state object pointed to by
213.Fa mbs
214is left in an undefined state and must be reinitialized before being
215used again.
216.Pp
217Because applications using
218.Fn mbrtowc
219or
220.Fn mbrtoc32
221are shielded from the specifics of the multibyte character encoding scheme,
222it is impossible to repair byte sequences containing encoding errors.
223Such byte sequences must be treated as invalid and potentially malicious input.
224Applications must stop processing the byte string pointed to by
225.Fa s
226and either discard any wide characters already converted, or cope with
227truncated input.
228.It Po Vt size_t Pc Ns \-2
229.Fa s
230points to an incomplete byte sequence of length
231.Fa n
232which has been consumed and contains part of a valid multibyte character.
233The character may be completed by calling the same function again with
234.Fa s
235pointing to one or more subsequent bytes of the multibyte character and
236.Fa mbs
237pointing to the conversion state object used during conversion of the
238incomplete byte sequence.
239.It Po Vt size_t Pc Ns \-3
240The next character resulting from a previous call has been stored into
241.Fa wc ,
242without consuming any additional bytes from
243.Fa s .
244This never happens for
245.Fn mbrtowc ,
246and on
247.Ox ,
248it never happens for
249.Fn mbrtoc32
250either.
251.El
252.Sh ERRORS
253.Fn mbrtowc
254and
255.Fn mbrtoc32
256cause an error in the following cases:
257.Bl -tag -width Er
258.It Bq Er EILSEQ
259.Fa s
260points to an invalid multibyte character.
261.It Bq Er EINVAL
262.Fa mbs
263points to an invalid or uninitialized
264.Vt mbstate_t
265object.
266.El
267.Sh SEE ALSO
268.Xr mbrlen 3 ,
269.Xr mbtowc 3 ,
270.Xr setlocale 3 ,
271.Xr wcrtomb 3
272.Sh STANDARDS
273.Fn mbrtowc
274conforms to
275.St -isoC-amd1 .
276The restrict qualifier was added at
277.St -isoC-99 .
278.Pp
279.Fn mbrtoc32
280conforms to
281.St -isoC-2011 .
282.Sh HISTORY
283.Fn mbrtowc
284has been available since
285.Ox 3.8
286and has provided support for UTF-8 since
287.Ox 4.8 .
288.Pp
289.Fn mbrtoc32
290has been available since
291.Ox 7.4 .
292.Sh CAVEATS
293.Fn mbrtowc
294and
295.Fn mbrtoc32
296are not suitable for programs that care about internals of the character
297encoding scheme used by the byte string pointed to by
298.Fa s .
299.Pp
300It is possible that these functions
301fail because of locale configuration errors.
302An
303.Dq invalid
304character sequence may simply be encoded in a different encoding than that
305of the current locale.
306.Pp
307The special cases for
308.Fa s No == Dv NULL
309and
310.Fa mbs No == Dv NULL
311do not make any sense.
312Instead of passing
313.Dv NULL
314for
315.Fa mbs ,
316.Xr mbtowc 3
317can be used.
318.Pp
319Earlier versions of this man page implied that calling
320.Fn mbrtowc
321with a
322.Dv NULL
323.Fa s
324argument would always set
325.Fa mbs
326to the initial conversion state.
327But this is true only if the previous call to
328.Fn mbrtowc
329using
330.Fa mbs
331did not return (size_t)-1 or (size_t)-2.
332It is recommended to zero the mbstate_t object instead.
333