xref: /openbsd/lib/libc/locale/mbtowc.3 (revision 4bdff4be)
1.\" $OpenBSD: mbtowc.3,v 1.8 2023/11/11 01:38:23 schwarze Exp $
2.\" $NetBSD: mbtowc.3,v 1.5 2003/09/08 17:54:31 wiz Exp $
3.\"
4.\" Copyright (c) 2016, 2023 Ingo Schwarze <schwarze@openbsd.org>
5.\" Copyright (c) 2010, 2015 Stefan Sperling <stsp@openbsd.org>
6.\" Copyright (c) 2002 Citrus Project,
7.\" All rights reserved.
8.\"
9.\" Redistribution and use in source and binary forms, with or without
10.\" modification, are permitted provided that the following conditions
11.\" are met:
12.\" 1. Redistributions of source code must retain the above copyright
13.\"    notice, this list of conditions and the following disclaimer.
14.\" 2. Redistributions in binary form must reproduce the above copyright
15.\"    notice, this list of conditions and the following disclaimer in the
16.\"    documentation and/or other materials provided with the distribution.
17.\"
18.\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
19.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
20.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
21.\" ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
22.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
23.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
24.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
25.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
26.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
27.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
28.\" SUCH DAMAGE.
29.\"
30.Dd $Mdocdate: November 11 2023 $
31.Dt MBTOWC 3
32.Os
33.Sh NAME
34.Nm mbtowc
35.Nd converts a multibyte character to a wide character
36.Sh SYNOPSIS
37.In stdlib.h
38.Ft int
39.Fn mbtowc "wchar_t * restrict pwc" "const char * restrict s" "size_t n"
40.Sh DESCRIPTION
41The
42.Fn mbtowc
43function converts the multibyte character pointed to by
44.Fa s
45to a wide character, and stores it in the wchar_t object pointed to by
46.Fa pwc .
47This function may inspect at most
48.Fa n
49bytes of the array pointed to by
50.Fa s .
51.Pp
52Unlike
53.Xr mbrtowc 3 ,
54the first
55.Fa n
56bytes pointed to by
57.Fa s
58need to form an entire multibyte character.
59Otherwise, this function returns an error and the internal state will
60be undefined.
61.Pp
62If a call to
63.Fn mbtowc
64results in an undefined internal state, parsing of the string starting at
65.Fa s
66cannot continue, not even at a later byte, and
67.Fn mbtowc
68must be called with
69.Ar s
70set to
71.Dv NULL
72to reset the internal state before it can safely be used again
73on a different string.
74.Pp
75The behaviour of
76.Fn mbtowc
77is affected by the
78.Dv LC_CTYPE
79category of the current locale.
80Calling any other functions in
81.Em libc
82never changes the internal
83state of
84.Fn mbtowc ,
85except for calling
86.Xr setlocale 3
87with the
88.Dv LC_CTYPE
89category set to a different locale.
90Such
91.Xr setlocale 3
92calls cause the internal state of this function to be undefined.
93.Pp
94In state-dependent encodings such as ISO/IEC 2022-JP,
95.Fa s
96may point to the special sequence of bytes to change the shift-state.
97Because such sequence bytes do not correspond to any individual wide character,
98.Fn mbtowc
99treats them as if they were part of the subsequent multibyte character.
100.Pp
101The following special cases apply to the arguments:
102.Bl -tag -width 012345678901
103.It s == NULL
104.Fn mbtowc
105initializes its own internal state to the initial state, and
106determines whether the current encoding is state-dependent.
107.Fn mbtowc
108returns 0 if the encoding is state-independent,
109otherwise non-zero.
110.Fa pwc
111is ignored.
112.It pwc == NULL
113.Fn mbtowc
114behaves just as if
115.Fa pwc
116was not
117.Dv NULL ,
118including modifications to internal state,
119except that the result of the conversion is discarded.
120This can be used to determine the size of the wide character
121representation of a multibyte string.
122Another use case is a check for illegal or incomplete multibyte sequences.
123.It n == 0
124In this case,
125the first
126.Fa n
127bytes of the array pointed to by
128.Fa s
129never form a complete character and
130.Fn mbtowc
131always fails.
132.El
133.Sh RETURN VALUES
134Normally,
135.Fn mbtowc
136returns:
137.Bl -tag -width 012345678901
138.It 0
139.Fa s
140points to a null byte
141.Pq Sq \e0 .
142.It positive
143Number of bytes for the valid multibyte character pointed to by
144.Fa s .
145There are no cases where the value returned is greater than
146the value of the
147.Dv MB_CUR_MAX
148macro.
149.It -1
150.Fa s
151points to an invalid or an incomplete multibyte character.
152.Va errno
153is set to indicate the error.
154.El
155.Pp
156When
157.Fa s
158is
159.Dv NULL ,
160.Fn mbtowc
161returns:
162.Bl -tag -width 0123456789
163.It 0
164The current encoding is state-independent.
165.It non-zero
166The current encoding is state-dependent.
167.El
168.Sh EXAMPLES
169The following program parses a UTF-8 string and reports encoding errors:
170.Bd -literal
171#include <limits.h>
172#include <locale.h>
173#include <stdio.h>
174#include <stdlib.h>
175
176int
177main(void)
178{
179	char	 s[LINE_MAX];
180	wchar_t	 wc;
181	int	 i, len;
182
183	setlocale(LC_CTYPE, "C.UTF-8");
184	if (fgets(s, sizeof(s), stdin) == NULL)
185		*s = '\e0';
186	for (i = 0, len = 1; len != 0; i += len) {
187		switch (len = mbtowc(&wc, s + i, MB_CUR_MAX)) {
188		case 0:
189			printf("byte %d end of string 0x00\en", i);
190			break;
191		case -1:
192			printf("byte %d invalid 0x%0.2hhx\en", i, s[i]);
193			len = 1;
194			break;
195		default:
196			printf("byte %d U+%0.4X %lc\en", i, wc, wc);
197			break;
198		}
199	}
200	return 0;
201}
202.Ed
203.Pp
204Recovering from encoding errors and continuing to parse the rest of the
205string as shown above is only possible for state-independent character
206encodings.
207For full generality, the error handling can be modified
208to reset the internal state.
209In that case, the rest of the string has to be skipped
210if the encoding is state-dependent:
211.Bd -literal
212		case -1:
213			printf("byte %d invalid 0x%0.2hhx\en", i, s[i]);
214			len = !mbtowc(NULL, NULL, MB_CUR_MAX);
215			break;
216.Ed
217.Sh ERRORS
218.Fn mbtowc
219will set
220.Va errno
221in the following cases:
222.Bl -tag -width Er
223.It Bq Er EILSEQ
224.Fa s
225points to an invalid or incomplete multibyte character.
226.El
227.Sh SEE ALSO
228.Xr mblen 3 ,
229.Xr mbrtowc 3 ,
230.Xr setlocale 3
231.Sh STANDARDS
232The
233.Fn mbtowc
234function conforms to
235.St -ansiC .
236The restrict qualifier is added at
237.St -isoC-99 .
238Setting
239.Va errno
240is an
241.St -p1003.1-2008
242extension.
243.Sh CAVEATS
244On error, callers of
245.Fn mbtowc
246cannot tell whether the multibyte character was invalid or incomplete.
247To treat incomplete data differently from invalid data the
248.Xr mbrtowc 3
249function can be used instead.
250