1# utf8.h # 2 3[![Build status](https://ci.appveyor.com/api/projects/status/phfjjahhs9j4gxvs?svg=true)](https://ci.appveyor.com/project/sheredom/utf8-h) 4 5[![Build status](https://api.travis-ci.org/repositories/sheredom/utf8.h.svg)](https://travis-ci.org/sheredom/utf8.h) 6 7A simple one header solution to supporting utf8 strings in C and C++. 8 9Functions provided from the C header string.h but with a utf8* prefix instead of the str* prefix: 10 11[API function docs](#api-function-docs) 12 13string.h | utf8.h | complete 14---------|--------|--------- 15strcat | utf8cat | ✔ 16strchr | utf8chr | ✔ 17strcmp | utf8cmp | ✔ 18strcoll | utf8coll | 19strcpy | utf8cpy | ✔ 20strcspn | utf8cspn | ✔ 21strdup | utf8dup | ✔ 22strfry | utf8fry | 23strlen | utf8len | ✔ 24strncat | utf8ncat | ✔ 25strncmp | utf8ncmp | ✔ 26strncpy | utf8ncpy | ✔ 27strndup | utf8ndup | ✔ 28strpbrk | utf8pbrk | ✔ 29strrchr | utf8rchr | ✔ 30strsep | utf8sep | 31strspn | utf8spn | ✔ 32strstr | utf8str | ✔ 33strtok | utf8tok | 34strxfrm | utf8xfrm | 35 36Functions provided from the C header strings.h but with a utf8* prefix instead of the str* prefix: 37 38strings.h | utf8.h | complete 39----------|--------|--------- 40strcasecmp | utf8casecmp | ~~✔~~ 41strncasecmp | utf8ncasecmp | ~~✔~~ 42strcasestr | utf8casestr | ~~✔~~ 43 44Functions provided that are unique to utf8.h: 45 46utf8.h | complete 47-------|--------- 48utf8codepoint | ✔ 49utf8size | ✔ 50utf8valid | ✔ 51utf8codepointsize | ✔ 52utf8catcodepoint | ✔ 53utf8isupper | ~~✔~~ 54utf8islower | ~~✔~~ 55utf8lwr | ~~✔~~ 56utf8upr | ~~✔~~ 57utf8lwrcodepoint | ~~✔~~ 58utf8uprcodepoint | ~~✔~~ 59 60## Usage ## 61 62Just include utf8.h in your code! 63 64The current supported compilers are gcc, clang and msvc. 65 66The current tested compiler versions are gcc 4.8.2, clang 3.5 and MSVC 18.0.21005.1. 67 68## Design ## 69 70The utf8.h API matches the string.h API as much as possible by design. There are a few major differences though. 71 72I use void* instead of char* when passing around utf8 strings. My reasoning is that I really don't want people accidentally thinking they can use integer arthimetic on the pointer and always get a valid character like you would with an ASCII string. Having it as a void* forces a user to explicitly cast the utf8 string to char* such that the onus is on them not to break the code anymore! 73 74Anywhere in the string.h or strings.h documentation where it refers to 'bytes' I have changed that to utf8 codepoints. For instance, utf8len will return the number of utf8 codepoints in a utf8 string - which does not necessarily equate to the number of bytes. 75 76## API function docs ## 77 78```c 79int utf8casecmp(const void *src1, const void *src2); 80``` 81Return less than 0, 0, greater than 0 if `src1 < src2`, `src1 == src2`, 82`src1 > src2` respectively, case insensitive. 83 84```c 85void *utf8cat(void *dst, const void *src); 86``` 87Append the utf8 string `src` onto the utf8 string `dst`. 88 89```c 90void *utf8chr(const void *src, long chr); 91``` 92Find the first match of the utf8 codepoint `chr` in the utf8 string `src`. 93 94```c 95int utf8cmp(const void *src1, const void *src2); 96``` 97Return less than 0, 0, greater than 0 if `src1 < src2`, 98`src1 == src2`, `src1 > src2` respectively. 99 100```c 101void *utf8cpy(void *dst, const void *src); 102``` 103Copy the utf8 string `src` onto the memory allocated in `dst`. 104 105```c 106size_t utf8cspn(const void *src, const void *reject); 107``` 108Number of utf8 codepoints in the utf8 string `src` that consists entirely 109of utf8 codepoints not from the utf8 string `reject`. 110 111```c 112void *utf8dup(const void *src); 113``` 114Duplicate the utf8 string `src` by getting its size, `malloc`ing a new buffer 115copying over the data, and returning that. Or 0 if `malloc` failed. 116 117```c 118size_t utf8len(const void *str); 119``` 120Number of utf8 codepoints in the utf8 string `str`, 121**excluding** the null terminating byte. 122 123```c 124int utf8ncasecmp(const void *src1, const void *src2, size_t n); 125``` 126Return less than 0, 0, greater than 0 if `src1 < src2`, `src1 == src2`, 127`src1 > src2` respectively, case insensitive. Checking at most `n` 128bytes of each utf8 string. 129 130```c 131void *utf8ncat(void *dst, const void *src, size_t n); 132``` 133Append the utf8 string `src` onto the utf8 string `dst`, 134writing at most `n+1` bytes. Can produce an invalid utf8 135string if `n` falls partway through a utf8 codepoint. 136 137```c 138int utf8ncmp(const void *src1, const void *src2, size_t n); 139``` 140Return less than 0, 0, greater than 0 if `src1 < src2`, 141`src1 == src2`, `src1 > src2` respectively. Checking at most `n` 142bytes of each utf8 string. 143 144```c 145void *utf8ncpy(void *dst, const void *src, size_t n); 146``` 147Copy the utf8 string `src` onto the memory allocated in `dst`. 148Copies at most `n` bytes. If there is no terminating null byte in 149the first `n` bytes of `src`, the string placed into `dst` will not be 150null-terminated. If the size (in bytes) of `src` is less than `n`, 151extra null terminating bytes are appended to `dst` such that at 152total of `n` bytes are written. Can produce an invalid utf8 153string if `n` falls partway through a utf8 codepoint. 154 155```c 156void *utf8pbrk(const void *str, const void *accept); 157``` 158Locates the first occurence in the utf8 string `str` of any byte in the 159utf8 string `accept`, or 0 if no match was found. 160 161```c 162void *utf8rchr(const void *src, int chr); 163``` 164Find the last match of the utf8 codepoint `chr` in the utf8 string `src`. 165 166```c 167size_t utf8size(const void *str); 168``` 169Number of bytes in the utf8 string `str`, 170including the null terminating byte. 171 172```c 173size_t utf8spn(const void *src, const void *accept); 174``` 175Number of utf8 codepoints in the utf8 string `src` that consists entirely 176of utf8 codepoints from the utf8 string `accept`. 177 178```c 179void *utf8str(const void *haystack, const void *needle); 180``` 181The position of the utf8 string `needle` in the utf8 string `haystack`. 182 183```c 184void *utf8casestr(const void *haystack, const void *needle); 185``` 186The position of the utf8 string `needle` in the utf8 string `haystack`, 187case insensitive. 188 189```c 190void *utf8valid(const void *str); 191``` 192Return 0 on success, or the position of the invalid utf8 codepoint on failure. 193 194```c 195void *utf8codepoint(const void *str, long *out_codepoint); 196``` 197Sets out_codepoint to the next utf8 codepoint in `str`, 198and returns the address of the utf8 codepoint after the current one in `str`. 199 200```c 201utf8_weak size_t utf8codepointsize(utf8_int32_t chr); 202``` 203Returns the size of the given codepoint in bytes. 204 205```c 206utf8_nonnull utf8_weak void *utf8catcodepoint(void *utf8_restrict str, 207 utf8_int32_t chr, size_t n); 208``` 209Write a codepoint to the given string, and return the address to the next 210place after the written codepoint. Pass how many bytes left in the buffer to 211n. If there is not enough space for the codepoint, this function returns 212null. 213 214```x 215utf8_weak int utf8islower(utf8_int32_t chr); 216``` 217Returns 1 if the given character is lowercase, or 0 if it is not. 218 219```c 220utf8_weak int utf8isupper(utf8_int32_t chr); 221``` 222Returns 1 if the given character is uppercase, or 0 if it is not. 223 224```c 225utf8_nonnull utf8_weak void utf8lwr(void *utf8_restrict str); 226``` 227Transform the given string into all lowercase codepoints. 228 229```c 230utf8_nonnull utf8_weak void utf8upr(void *utf8_restrict str); 231``` 232Transform the given string into all uppercase codepoints. 233 234```c 235utf8_weak utf8_int32_t utf8lwrcodepoint(utf8_int32_t cp); 236``` 237Make a codepoint lower case if possible. 238 239```c 240utf8_weak utf8_int32_t utf8uprcodepoint(utf8_int32_t cp); 241``` 242Make a codepoint upper case if possible. 243 244## Codepoint Case 245 246Various functions provided will do case insensitive compares, or transform utf8 247strings from one case to another. Given the vastness of unicode, and the authors 248lack of understanding beyond latin codepoints on whether case means anything, 249the following categories are the only ones that will be checked in case 250insensitive code: 251 252* [ASCII](https://en.wikipedia.org/wiki/Basic_Latin_(Unicode_block)) 253* [Latin-1 Supplement](https://en.wikipedia.org/wiki/Latin-1_Supplement_(Unicode_block)) 254* [Latin Extended-A](https://en.wikipedia.org/wiki/Latin_Extended-A) 255* [Latin Extended-B](https://en.wikipedia.org/wiki/Latin_Extended-B) 256* [Greek and Coptic](https://en.wikipedia.org/wiki/Greek_and_Coptic) 257 258## Todo ## 259 260- Implement utf8coll (akin to strcoll). 261- Implement utf8fry (akin to strfry). 262- ~~Add NULL pointer support. Should I NULL check the arguments to the API?~~ 263- Add Doxygen (or similar) to mimic the Unix man pages for string.h. 264- Investigate adding dst buffer sizes for utf8cpy and utf8cat to catch overwrites (as suggested by [@FlohOfWoe](https://twitter.com/FlohOfWoe) in https://twitter.com/FlohOfWoe/status/618669237771608064) 265- Investigate adding a utf8canon which would turn 'bad' utf8 sequences (like ASCII values encoded in 4-byte utf8 codepoints) into their 'good' equivalents (as suggested by [@KmBenzie](https://twitter.com/KmBenzie)) 266- Investigate changing to [Creative Commons Zero License](http://creativecommons.org/publicdomain/zero/1.0/legalcode.txt) (as suggested by [@mcclure111](https://twitter.com/mcclure111)) 267 268## License ## 269 270This is free and unencumbered software released into the public domain. 271 272Anyone is free to copy, modify, publish, use, compile, sell, or 273distribute this software, either in source code form or as a compiled 274binary, for any purpose, commercial or non-commercial, and by any 275means. 276 277In jurisdictions that recognize copyright laws, the author or authors 278of this software dedicate any and all copyright interest in the 279software to the public domain. We make this dedication for the benefit 280of the public at large and to the detriment of our heirs and 281successors. We intend this dedication to be an overt act of 282relinquishment in perpetuity of all present and future rights to this 283software under copyright law. 284 285THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, 286EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF 287MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. 288IN NO EVENT SHALL THE AUTHORS BE LIABLE FOR ANY CLAIM, DAMAGES OR 289OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, 290ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR 291OTHER DEALINGS IN THE SOFTWARE. 292 293For more information, please refer to <http://unlicense.org/> 294