1# utf8.h #
2
3[![Build status](https://ci.appveyor.com/api/projects/status/phfjjahhs9j4gxvs?svg=true)](https://ci.appveyor.com/project/sheredom/utf8-h)
4
5[![Build status](https://api.travis-ci.org/repositories/sheredom/utf8.h.svg)](https://travis-ci.org/sheredom/utf8.h)
6
7A simple one header solution to supporting utf8 strings in C and C++.
8
9Functions provided from the C header string.h but with a utf8* prefix instead of the str* prefix:
10
11[API function docs](#api-function-docs)
12
13string.h | utf8.h | complete
14---------|--------|---------
15strcat | utf8cat | ✔
16strchr | utf8chr | ✔
17strcmp | utf8cmp | ✔
18strcoll | utf8coll |
19strcpy | utf8cpy | ✔
20strcspn | utf8cspn | ✔
21strdup | utf8dup | ✔
22strfry | utf8fry |
23strlen | utf8len | ✔
24strncat | utf8ncat | ✔
25strncmp | utf8ncmp | ✔
26strncpy | utf8ncpy | ✔
27strndup | utf8ndup | ✔
28strpbrk | utf8pbrk | ✔
29strrchr | utf8rchr | ✔
30strsep | utf8sep |
31strspn | utf8spn | ✔
32strstr | utf8str | ✔
33strtok | utf8tok |
34strxfrm | utf8xfrm |
35
36Functions provided from the C header strings.h but with a utf8* prefix instead of the str* prefix:
37
38strings.h | utf8.h | complete
39----------|--------|---------
40strcasecmp | utf8casecmp | ~~✔~~
41strncasecmp | utf8ncasecmp | ~~✔~~
42strcasestr | utf8casestr | ~~✔~~
43
44Functions provided that are unique to utf8.h:
45
46utf8.h | complete
47-------|---------
48utf8codepoint | ✔
49utf8size | ✔
50utf8valid | ✔
51utf8codepointsize | ✔
52utf8catcodepoint | ✔
53utf8isupper |  ~~✔~~
54utf8islower | ~~✔~~
55utf8lwr | ~~✔~~
56utf8upr | ~~✔~~
57utf8lwrcodepoint | ~~✔~~
58utf8uprcodepoint | ~~✔~~
59
60## Usage ##
61
62Just include utf8.h in your code!
63
64The current supported compilers are gcc, clang and msvc.
65
66The current tested compiler versions are gcc 4.8.2, clang 3.5 and MSVC 18.0.21005.1.
67
68## Design ##
69
70The utf8.h API matches the string.h API as much as possible by design. There are a few major differences though.
71
72I use void* instead of char* when passing around utf8 strings. My reasoning is that I really don't want people accidentally thinking they can use integer arthimetic on the pointer and always get a valid character like you would with an ASCII string. Having it as a void* forces a user to explicitly cast the utf8 string to char* such that the onus is on them not to break the code anymore!
73
74Anywhere in the string.h or strings.h documentation where it refers to 'bytes' I have changed that to utf8 codepoints. For instance, utf8len will return the number of utf8 codepoints in a utf8 string - which does not necessarily equate to the number of bytes.
75
76## API function docs ##
77
78```c
79int utf8casecmp(const void *src1, const void *src2);
80```
81Return less than 0, 0, greater than 0 if `src1 < src2`, `src1 == src2`,
82`src1 > src2` respectively, case insensitive.
83
84```c
85void *utf8cat(void *dst, const void *src);
86```
87Append the utf8 string `src` onto the utf8 string `dst`.
88
89```c
90void *utf8chr(const void *src, long chr);
91```
92Find the first match of the utf8 codepoint `chr` in the utf8 string `src`.
93
94```c
95int utf8cmp(const void *src1, const void *src2);
96```
97Return less than 0, 0, greater than 0 if `src1 < src2`,
98`src1 == src2`, `src1 > src2` respectively.
99
100```c
101void *utf8cpy(void *dst, const void *src);
102```
103Copy the utf8 string `src` onto the memory allocated in `dst`.
104
105```c
106size_t utf8cspn(const void *src, const void *reject);
107```
108Number of utf8 codepoints in the utf8 string `src` that consists entirely
109of utf8 codepoints not from the utf8 string `reject`.
110
111```c
112void *utf8dup(const void *src);
113```
114Duplicate the utf8 string `src` by getting its size, `malloc`ing a new buffer
115copying over the data, and returning that. Or 0 if `malloc` failed.
116
117```c
118size_t utf8len(const void *str);
119```
120Number of utf8 codepoints in the utf8 string `str`,
121**excluding** the null terminating byte.
122
123```c
124int utf8ncasecmp(const void *src1, const void *src2, size_t n);
125```
126Return less than 0, 0, greater than 0 if `src1 < src2`, `src1 == src2`,
127`src1 > src2` respectively, case insensitive. Checking at most `n`
128bytes of each utf8 string.
129
130```c
131void *utf8ncat(void *dst, const void *src, size_t n);
132```
133Append the utf8 string `src` onto the utf8 string `dst`,
134writing at most `n+1` bytes. Can produce an invalid utf8
135string if `n` falls partway through a utf8 codepoint.
136
137```c
138int utf8ncmp(const void *src1, const void *src2, size_t n);
139```
140Return less than 0, 0, greater than 0 if `src1 < src2`,
141`src1 == src2`, `src1 > src2` respectively. Checking at most `n`
142bytes of each utf8 string.
143
144```c
145void *utf8ncpy(void *dst, const void *src, size_t n);
146```
147Copy the utf8 string `src` onto the memory allocated in `dst`.
148Copies at most `n` bytes. If there is no terminating null byte in
149the first `n` bytes of `src`, the string placed into `dst` will not be
150null-terminated. If the size (in bytes) of `src` is less than `n`,
151extra null terminating bytes are appended to `dst` such that at
152total of `n` bytes are written. Can produce an invalid utf8
153string if `n` falls partway through a utf8 codepoint.
154
155```c
156void *utf8pbrk(const void *str, const void *accept);
157```
158Locates the first occurence in the utf8 string `str` of any byte in the
159utf8 string `accept`, or 0 if no match was found.
160
161```c
162void *utf8rchr(const void *src, int chr);
163```
164Find the last match of the utf8 codepoint `chr` in the utf8 string `src`.
165
166```c
167size_t utf8size(const void *str);
168```
169Number of bytes in the utf8 string `str`,
170including the null terminating byte.
171
172```c
173size_t utf8spn(const void *src, const void *accept);
174```
175Number of utf8 codepoints in the utf8 string `src` that consists entirely
176of utf8 codepoints from the utf8 string `accept`.
177
178```c
179void *utf8str(const void *haystack, const void *needle);
180```
181The position of the utf8 string `needle` in the utf8 string `haystack`.
182
183```c
184void *utf8casestr(const void *haystack, const void *needle);
185```
186The position of the utf8 string `needle` in the utf8 string `haystack`,
187case insensitive.
188
189```c
190void *utf8valid(const void *str);
191```
192Return 0 on success, or the position of the invalid utf8 codepoint on failure.
193
194```c
195void *utf8codepoint(const void *str, long *out_codepoint);
196```
197Sets out_codepoint to the next utf8 codepoint in `str`,
198and returns the address of the utf8 codepoint after the current one in `str`.
199
200```c
201utf8_weak size_t utf8codepointsize(utf8_int32_t chr);
202```
203Returns the size of the given codepoint in bytes.
204
205```c
206utf8_nonnull utf8_weak void *utf8catcodepoint(void *utf8_restrict str,
207                                              utf8_int32_t chr, size_t n);
208```
209Write a codepoint to the given string, and return the address to the next
210place after the written codepoint. Pass how many bytes left in the buffer to
211n. If there is not enough space for the codepoint, this function returns
212null.
213
214```x
215utf8_weak int utf8islower(utf8_int32_t chr);
216```
217Returns 1 if the given character is lowercase, or 0 if it is not.
218
219```c
220utf8_weak int utf8isupper(utf8_int32_t chr);
221```
222Returns 1 if the given character is uppercase, or 0 if it is not.
223
224```c
225utf8_nonnull utf8_weak void utf8lwr(void *utf8_restrict str);
226```
227Transform the given string into all lowercase codepoints.
228
229```c
230utf8_nonnull utf8_weak void utf8upr(void *utf8_restrict str);
231```
232Transform the given string into all uppercase codepoints.
233
234```c
235utf8_weak utf8_int32_t utf8lwrcodepoint(utf8_int32_t cp);
236```
237Make a codepoint lower case if possible.
238
239```c
240utf8_weak utf8_int32_t utf8uprcodepoint(utf8_int32_t cp);
241```
242Make a codepoint upper case if possible.
243
244## Codepoint Case
245
246Various functions provided will do case insensitive compares, or transform utf8
247strings from one case to another. Given the vastness of unicode, and the authors
248lack of understanding beyond latin codepoints on whether case means anything,
249the following categories are the only ones that will be checked in case
250insensitive code:
251
252* [ASCII](https://en.wikipedia.org/wiki/Basic_Latin_(Unicode_block))
253* [Latin-1 Supplement](https://en.wikipedia.org/wiki/Latin-1_Supplement_(Unicode_block))
254* [Latin Extended-A](https://en.wikipedia.org/wiki/Latin_Extended-A)
255* [Latin Extended-B](https://en.wikipedia.org/wiki/Latin_Extended-B)
256* [Greek and Coptic](https://en.wikipedia.org/wiki/Greek_and_Coptic)
257
258## Todo ##
259
260- Implement utf8coll (akin to strcoll).
261- Implement utf8fry (akin to strfry).
262- ~~Add NULL pointer support. Should I NULL check the arguments to the API?~~
263- Add Doxygen (or similar) to mimic the Unix man pages for string.h.
264- Investigate adding dst buffer sizes for utf8cpy and utf8cat to catch overwrites (as suggested by [@FlohOfWoe](https://twitter.com/FlohOfWoe) in https://twitter.com/FlohOfWoe/status/618669237771608064)
265- Investigate adding a utf8canon which would turn 'bad' utf8 sequences (like ASCII values encoded in 4-byte utf8 codepoints) into their 'good' equivalents (as suggested by [@KmBenzie](https://twitter.com/KmBenzie))
266- Investigate changing to [Creative Commons Zero License](http://creativecommons.org/publicdomain/zero/1.0/legalcode.txt) (as suggested by [@mcclure111](https://twitter.com/mcclure111))
267
268## License ##
269
270This is free and unencumbered software released into the public domain.
271
272Anyone is free to copy, modify, publish, use, compile, sell, or
273distribute this software, either in source code form or as a compiled
274binary, for any purpose, commercial or non-commercial, and by any
275means.
276
277In jurisdictions that recognize copyright laws, the author or authors
278of this software dedicate any and all copyright interest in the
279software to the public domain. We make this dedication for the benefit
280of the public at large and to the detriment of our heirs and
281successors. We intend this dedication to be an overt act of
282relinquishment in perpetuity of all present and future rights to this
283software under copyright law.
284
285THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
286EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
287MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
288IN NO EVENT SHALL THE AUTHORS BE LIABLE FOR ANY CLAIM, DAMAGES OR
289OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
290ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
291OTHER DEALINGS IN THE SOFTWARE.
292
293For more information, please refer to <http://unlicense.org/>
294