xref: /openbsd/lib/libc/locale/mbrtowc.3 (revision cca36db2)
1.\" $OpenBSD: mbrtowc.3,v 1.3 2010/12/05 14:59:49 stsp Exp $
2.\" $NetBSD: mbrtowc.3,v 1.5 2003/09/08 17:54:31 wiz Exp $
3.\"
4.\" Copyright (c)2002 Citrus Project,
5.\" All rights reserved.
6.\"
7.\" Redistribution and use in source and binary forms, with or without
8.\" modification, are permitted provided that the following conditions
9.\" are met:
10.\" 1. Redistributions of source code must retain the above copyright
11.\"    notice, this list of conditions and the following disclaimer.
12.\" 2. Redistributions in binary form must reproduce the above copyright
13.\"    notice, this list of conditions and the following disclaimer in the
14.\"    documentation and/or other materials provided with the distribution.
15.\"
16.\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
17.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
18.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
19.\" ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
20.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
21.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
22.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
23.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
24.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
25.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
26.\" SUCH DAMAGE.
27.\"
28.Dd $Mdocdate: December 5 2010 $
29.Dt MBRTOWC 3
30.Os
31.Sh NAME
32.Nm mbrtowc
33.Nd converts a multibyte character to a wide character (restartable)
34.Sh SYNOPSIS
35.Fd #include <wchar.h>
36.Ft size_t
37.Fn mbrtowc "wchar_t * restrict wc" "const char * restrict s" "size_t n" \
38"mbstate_t * restrict mbs"
39.Sh DESCRIPTION
40The
41.Fn mbrtowc
42function examines at most
43.Fa n
44bytes of the multibyte character byte string pointed to by
45.Fa s ,
46converts those bytes to a wide character, and stores the wide character
47in the wchar_t object pointed to by
48.Fa wc
49if
50.Fa wc
51is not
52.Dv NULL
53and
54.Fa s
55points to a valid character.
56.Pp
57Conversion happens in accordance with the conversion state described
58by the mbstate_t object pointed to by
59.Fa mbs .
60The mbstate_t object must be initialized to zero before the application's
61first call to
62.Fn mbrtowc .
63If the previous call to
64.Fn mbrtowc
65did not return (size_t)-1, the mbstate_t object can safely be reused
66without reinitialization.
67.Pp
68The behaviour of
69.Fn mbrtowc
70is affected by the
71.Dv LC_CTYPE
72category of the current locale.
73If the locale is changed without reinitialization of the mbstate_t object
74pointed to by
75.Fa mbs ,
76the behaviour of
77.Fn mbrtowc
78is undefined.
79.Pp
80Unlike
81.Xr mbtowc 3 ,
82.Fn mbrtowc
83will accept an incomplete byte sequence pointed to by
84.Fa s
85which does not form a complete character but is potentially part of
86a valid character.
87In this case,
88.Fn mbrtowc
89consumes all such bytes.
90The conversion state saved in the mbstate_t object pointed to by
91.Fa mbs
92will be used to restart the suspended conversion during the next
93call to
94.Fn mbrtowc .
95.Pp
96In state-dependent encodings,
97.Fa s
98may point to a special sequence of bytes called a
99.Dq shift sequence .
100Shift sequences switch between character code sets available within an
101encoding scheme.
102One encoding scheme using shift sequences is ISO/IEC 2022-JP, which
103can switch e.g. from ASCII (which uses one byte per character) to
104JIS X 0208 (which uses two bytes per character).
105Shift sequence bytes correspond to no individual wide character, so
106.Fn mbrtowc
107treats them as if they were part of the subsequent multibyte character.
108Therefore they do contribute to the number of bytes in the multibyte character.
109.Pp
110Special cases in interpretation of arguments are as follows:
111.Bl -tag -width 012345678901
112.It "wc == NULL "
113The conversion from a multibyte character to a wide character is performed
114and the conversion state may be affected, but the resulting wide character
115is discarded.
116.Pp
117This can be used to find out how many bytes are contained in the
118multibyte character pointed to by
119.Fa s .
120.It "s == NULL "
121.Fn mbrtowc
122ignores
123.Fa wc
124and
125.Fa n ,
126and behaves equivalent to
127.Bd -literal -offset indent
128mbrtowc(NULL, "", 1, mbs);
129.Ed
130.Pp
131which attempts to use the mbstate_t object pointed to by
132.Fa mbs
133to start or continue conversion using the empty string as input,
134and discards the conversion result.
135.Pp
136If conversion succeeds, this call always returns zero.
137Unlike
138.Xr mbtowc 3 ,
139the value returned does not indicate whether the current encoding of
140the locale is state-dependent, i.e. uses shift sequences.
141.It "mbs == NULL "
142.Fn mbrtowc
143uses its own internal state object to keep the conversion state,
144instead of an mbstate_t object pointed to by
145.Fa mbs .
146This internal conversion state is initialized once at program startup.
147It is not safe to call
148.Fn mbrtowc
149again with a
150.Dv NULL
151.Fa mbs
152argument if
153.Fn mbrtowc
154returned (size_t)-1 because at this point the internal conversion state
155is undefined.
156.Pp
157Calling any other functions in
158.Em libc
159never changes the internal
160conversion state object of
161.Fn mbrtowc .
162.El
163.Sh RETURN VALUES
164.Bl -tag -width 012345678901
165.It 0
166The bytes pointed to by
167.Fa s
168form a terminating NUL character.
169If
170.Fa wc
171is not
172.Dv NULL ,
173a NUL wide character has been stored in the wchar_t object pointed to by
174.Fa wc .
175.It positive
176.Fa s
177points to a valid character, and the value returned is the number of
178bytes completing the character.
179If
180.Fa wc
181is not
182.Dv NULL ,
183the corresponding wide character has been stored in the wchar_t object
184pointed to by
185.Fa wc .
186.It (size_t)-1
187.Fa s
188points to an illegal byte sequence which does not form a valid multibyte
189character in the current locale.
190.Fn mbrtowc
191sets
192.Va errno
193to EILSEQ.
194The conversion state object pointed to by
195.Fa mbs
196is left in an undefined state and must be reinitialized before being
197used again.
198.Pp
199Because applications using
200.Fn mbrtowc
201are shielded from the specifics of the multibyte character encoding scheme,
202it is impossible to repair byte sequences containing encoding errors.
203Such byte sequences must be treated as invalid and potentially malicious input.
204Applications must stop processing the byte string pointed to by
205.Fa s
206and either discard any wide characters already converted, or cope with
207truncated input.
208.It (size_t)-2
209.Fa s
210points to an incomplete byte sequence of length
211.Fa n
212which has been consumed and contains part of a valid multibyte character.
213.Fn mbrtowc
214sets
215.Va errno
216to EILSEQ.
217The character may be completed by calling
218.Fn mbrtowc
219again with
220.Fa s
221pointing to one or more subsequent bytes of the multibyte character and
222.Fa mbs
223pointing to the conversion state object used during conversion of the
224incomplete byte sequence.
225.El
226.Sh ERRORS
227The
228.Fn mbrtowc
229function may cause an error in the following cases:
230.Bl -tag -width Er
231.It Bq Er EILSEQ
232.Fa s
233points to an invalid or incomplete multibyte character.
234.It Bq Er EINVAL
235.Fa mbs
236points to an invalid or uninitialized mbstate_t object.
237.El
238.Sh SEE ALSO
239.Xr mbrlen 3 ,
240.Xr mbtowc 3 ,
241.Xr setlocale 3
242.Sh STANDARDS
243The
244.Fn mbrtowc
245function conforms to
246.\" .St -isoC-amd1 .
247ISO/IEC 9899/AMD1:1995
248.Pq Dq ISO C90, Amendment 1 .
249The restrict qualifier is added at
250.\" .St -isoC99 .
251ISO/IEC 9899:1999
252.Pq Dq ISO C99 .
253.Sh CAVEATS
254.Fn mbrtowc
255is not suitable for programs that care about internals of the character
256encoding scheme used by the byte string pointed to by
257.Fa s .
258.Pp
259It is possible that
260.Fn mbrtowc
261fails because of locale configuration errors.
262An
263.Dq invalid
264character sequence may simply be encoded in a different encoding than that
265of the current locale.
266.Pp
267The special cases for
268.Fa s
269== NULL and
270.Fa mbs
271== NULL do not make any sense.
272Instead of passing
273.Dv NULL
274for
275.Fa mbs ,
276.Xr mbtowc 3
277can be used.
278.Pp
279Earlier versions of this man page implied that calling
280.Fn mbrtowc
281with a
282.Dv NULL
283.Fa s
284argument would always set
285.Fa mbs
286to the initial conversion state.
287But this is true only if the previous call to
288.Fn mbrtowc
289using
290.Fa mbs
291did not return (size_t)-1 or (size_t)-2.
292It is recommended to zero the mbstate_t object instead.
293