1# UTF8-CPP: UTF-8 with C++ in a Portable Way
2
3
4## Introduction
5
6C++ developers miss an easy and portable way of handling Unicode encoded strings. The original C++ Standard (known as C++98 or C++03) is Unicode agnostic. C++11 provides some support for Unicode on core language and library level: u8, u, and U character and string literals, char16_t and char32_t character types, u16string and u32string library classes, and codecvt support for conversions between Unicode encoding forms. In the meantime, developers use third party libraries like ICU, OS specific capabilities, or simply roll out their own solutions.
7
8In order to easily handle UTF-8 encoded Unicode strings, I came up with a small, C++98 compatible generic library. For anybody used to work with STL algorithms and iterators, it should be easy and natural to use. The code is freely available for any purpose - check out the [license](./LICENSE). The library has been used a lot in the past ten years both in commercial and open-source projects and is considered feature-complete now. If you run into bugs or performance issues, please let me know and I'll do my best to address them.
9
10The purpose of this article is not to offer an introduction to Unicode in general, and UTF-8 in particular. If you are not familiar with Unicode, be sure to check out [Unicode Home Page](http://www.unicode.org/) or some other source of information for Unicode. Also, it is not my aim to advocate the use of UTF-8 encoded strings in C++ programs; if you want to handle UTF-8 encoded strings from C++, I am sure you have good reasons for it.
11
12## Examples of use
13
14### Introductionary Sample
15
16To illustrate the use of the library, let's start with a small but complete program that opens a file containing UTF-8 encoded text, reads it line by line, checks each line for invalid UTF-8 byte sequences, and converts it to UTF-16 encoding and back to UTF-8:
17
18```cpp
19#include <fstream>
20#include <iostream>
21#include <string>
22#include <vector>
23#include "utf8.h"
24using namespace std;
25int main(int argc, char** argv)
26{
27    if (argc != 2) {
28        cout << "\nUsage: docsample filename\n";
29        return 0;
30    }
31    const char* test_file_path = argv[1];
32    // Open the test file (must be UTF-8 encoded)
33    ifstream fs8(test_file_path);
34    if (!fs8.is_open()) {
35        cout << "Could not open " << test_file_path << endl;
36        return 0;
37    }
38
39    unsigned line_count = 1;
40    string line;
41    // Play with all the lines in the file
42    while (getline(fs8, line)) {
43        // check for invalid utf-8 (for a simple yes/no check, there is also utf8::is_valid function)
44#if __cplusplus >= 201103L // C++ 11 or later
45        auto end_it = utf8::find_invalid(line.begin(), line.end());
46#else
47        string::iterator end_it = utf8::find_invalid(line.begin(), line.end());
48#endif // C++ 11
49        if (end_it != line.end()) {
50            cout << "Invalid UTF-8 encoding detected at line " << line_count << "\n";
51            cout << "This part is fine: " << string(line.begin(), end_it) << "\n";
52        }
53        // Get the line length (at least for the valid part)
54        int length = utf8::distance(line.begin(), end_it);
55        cout << "Length of line " << line_count << " is " << length <<  "\n";
56
57        // Convert it to utf-16
58#if __cplusplus >= 201103L // C++ 11 or later
59        u16string utf16line = utf8::utf8to16(line);
60#else
61        vector<unsigned short> utf16line;
62        utf8::utf8to16(line.begin(), end_it, back_inserter(utf16line));
63#endif // C++ 11
64        // And back to utf-8;
65#if __cplusplus >= 201103L // C++ 11 or later
66        string utf8line = utf8::utf16to8(utf16line);
67#else
68        string utf8line;
69        utf8::utf16to8(utf16line.begin(), utf16line.end(), back_inserter(utf8line));
70#endif // C++ 11
71        // Confirm that the conversion went OK:
72        if (utf8line != string(line.begin(), end_it))
73            cout << "Error in UTF-16 conversion at line: " << line_count << "\n";
74
75        line_count++;
76    }
77
78    return 0;
79}
80```
81
82In the previous code sample, for each line we performed a detection of invalid UTF-8 sequences with `find_invalid`; the number of characters (more precisely - the number of Unicode code points, including the end of line and even BOM if there is one) in each line was determined with a use of `utf8::distance`; finally, we have converted each line to UTF-16 encoding with `utf8to16` and back to UTF-8 with `utf16to8`.
83
84Note a different pattern of usage for old compilers. For instance, this is how we convert
85a UTF-8 encoded string to a UTF-16 encoded one with a pre - C++11 compiler:
86```cpp
87    vector<unsigned short> utf16line;
88    utf8::utf8to16(line.begin(), end_it, back_inserter(utf16line));
89```
90
91With a more modern compiler, the same operation would look like:
92```cpp
93    u16string utf16line = utf8::utf8to16(line);
94```
95If `__cplusplus` macro points to a C++ 11 or later, the library exposes API that takes into
96account C++ standard Unicode strings and move semantics. With an older compiler, it is still
97possible to use the same functionality, just in a little less convenient way
98
99In case you do not trust the `__cplusplus` macro or, for instance, do not want to include
100the C++ 11 helper functions even with a modern compiler, define `UTF_CPP_CPLUSPLUS` macro
101before including `utf8.h` and assign it a value for the standard you want to use - the values are the same as for the `__cplusplus` macro. This can be also useful with compilers that are conservative in setting the `__cplusplus` macro even if they have a good support for a recent standard edition - Microsoft's Visual C++ is one example.
102
103### Checking if a file contains valid UTF-8 text
104
105Here is a function that checks whether the content of a file is valid UTF-8 encoded text without reading the content into the memory:
106
107```cpp
108bool valid_utf8_file(const char* file_name)
109{
110    ifstream ifs(file_name);
111    if (!ifs)
112        return false; // even better, throw here
113
114    istreambuf_iterator<char> it(ifs.rdbuf());
115    istreambuf_iterator<char> eos;
116
117    return utf8::is_valid(it, eos);
118}
119```
120
121Because the function `utf8::is_valid()` works with input iterators, we were able to pass an `istreambuf_iterator` to `it` and read the content of the file directly without loading it to the memory first.
122
123Note that other functions that take input iterator arguments can be used in a similar way. For instance, to read the content of a UTF-8 encoded text file and convert the text to UTF-16, just do something like:
124
125```cpp
126    utf8::utf8to16(it, eos, back_inserter(u16string));
127```
128
129### Ensure that a string contains valid UTF-8 text
130
131If we have some text that "probably" contains UTF-8 encoded text and we want to replace any invalid UTF-8 sequence with a replacement character, something like the following function may be used:
132
133```cpp
134void fix_utf8_string(std::string& str)
135{
136    std::string temp;
137    utf8::replace_invalid(str.begin(), str.end(), back_inserter(temp));
138    str = temp;
139}
140```
141
142The function will replace any invalid UTF-8 sequence with a Unicode replacement character. There is an overloaded function that enables the caller to supply their own replacement character.
143
144
145## Points of interest
146
147#### Design goals and decisions
148
149The library was designed to be:
150
1511.  Generic: for better or worse, there are many C++ string classes out there, and the library should work with as many of them as possible.
1522.  Portable: the library should be portable both accross different platforms and compilers. The only non-portable code is a small section that declares unsigned integers of different sizes: three typedefs. They can be changed by the users of the library if they don't match their platform. The default setting should work for Windows (both 32 and 64 bit), and most 32 bit and 64 bit Unix derivatives. Support for post C++03 language features is included for modern compilers at API level only, so the library should work even with pretty old compilers.
1533.  Lightweight: follow the "pay only for what you use" guideline.
1544.  Unintrusive: avoid forcing any particular design or even programming style on the user. This is a library, not a framework.
155
156#### Alternatives
157
158In case you want to look into other means of working with UTF-8 strings from C++, here is the list of solutions I am aware of:
159
1601.  [ICU Library](http://icu.sourceforge.net/). It is very powerful, complete, feature-rich, mature, and widely used. Also big, intrusive, non-generic, and doesn't play well with the Standard Library. I definitelly recommend looking at ICU even if you don't plan to use it.
1612.  C++11 language and library features. Still far from complete, and not easy to use.
1623.  [Glib::ustring](http://www.gtkmm.org/gtkmm2/docs/tutorial/html/ch03s04.html). A class specifically made to work with UTF-8 strings, and also feel like `std::string`. If you prefer to have yet another string class in your code, it may be worth a look. Be aware of the licensing issues, though.
1634.  Platform dependent solutions: Windows and POSIX have functions to convert strings from one encoding to another. That is only a subset of what my library offers, but if that is all you need it may be good enough.
164
165
166## Reference
167
168### Functions From utf8 Namespace
169
170#### utf8::append
171
172Available in version 3.0 and later. Requires a C++ 11 compliant compiler.
173
174Encodes a 32 bit code point as a UTF-8 sequence of octets and appends the sequence to a UTF-8 string.
175
176```cpp
177void append(char32_t cp, std::string& s);
178```
179
180`cp`: a code point to append to the string.
181`s`: a utf-8 encoded string to append the code point to.
182
183Example of use:
184
185```cpp
186std::string u;
187append(0x0448, u);
188assert (u[0] == char(0xd1) && u[1] == char(0x88) && u.length() == 2);
189```
190
191In case of an invalid code point, a `utf8::invalid_code_point` exception is thrown.
192
193
194#### utf8::append
195
196Available in version 1.0 and later.
197
198Encodes a 32 bit code point as a UTF-8 sequence of octets and appends the sequence to a UTF-8 string.
199
200```cpp
201template <typename octet_iterator>
202octet_iterator append(uint32_t cp, octet_iterator result);
203```
204
205`octet_iterator`: an output iterator.
206`cp`: a 32 bit integer representing a code point to append to the sequence.
207`result`: an output iterator to the place in the sequence where to append the code point.
208Return value: an iterator pointing to the place after the newly appended sequence.
209
210Example of use:
211
212```cpp
213unsigned char u[5] = {0,0,0,0,0};
214unsigned char* end = append(0x0448, u);
215assert (u[0] == 0xd1 && u[1] == 0x88 && u[2] == 0 && u[3] == 0 && u[4] == 0);
216```
217
218Note that `append` does not allocate any memory - it is the burden of the caller to make sure there is enough memory allocated for the operation. To make things more interesting, `append` can add anywhere between 1 and 4 octets to the sequence. In practice, you would most often want to use `std::back_inserter` to ensure that the necessary memory is allocated.
219
220In case of an invalid code point, a `utf8::invalid_code_point` exception is thrown.
221
222#### utf8::next
223
224Available in version 1.0 and later.
225
226Given the iterator to the beginning of the UTF-8 sequence, it returns the code point and moves the iterator to the next position.
227
228```cpp
229template <typename octet_iterator>
230uint32_t next(octet_iterator& it, octet_iterator end);
231```
232
233`octet_iterator`: an input iterator.
234`it`: a reference to an iterator pointing to the beginning of an UTF-8 encoded code point. After the function returns, it is incremented to point to the beginning of the next code point.
235`end`: end of the UTF-8 sequence to be processed. If `it` gets equal to `end` during the extraction of a code point, an `utf8::not_enough_room` exception is thrown.
236Return value: the 32 bit representation of the processed UTF-8 code point.
237
238Example of use:
239
240```cpp
241char* twochars = "\xe6\x97\xa5\xd1\x88";
242char* w = twochars;
243int cp = next(w, twochars + 6);
244assert (cp == 0x65e5);
245assert (w == twochars + 3);
246```
247
248This function is typically used to iterate through a UTF-8 encoded string.
249
250In case of an invalid UTF-8 seqence, a `utf8::invalid_utf8` exception is thrown.
251
252#### utf8::peek_next
253
254Available in version 2.1 and later.
255
256Given the iterator to the beginning of the UTF-8 sequence, it returns the code point for the following sequence without changing the value of the iterator.
257
258```cpp
259template <typename octet_iterator>
260uint32_t peek_next(octet_iterator it, octet_iterator end);
261```
262
263
264`octet_iterator`: an input iterator.
265`it`: an iterator pointing to the beginning of an UTF-8 encoded code point.
266`end`: end of the UTF-8 sequence to be processed. If `it` gets equal to `end` during the extraction of a code point, an `utf8::not_enough_room` exception is thrown.
267Return value: the 32 bit representation of the processed UTF-8 code point.
268
269Example of use:
270
271```cpp
272char* twochars = "\xe6\x97\xa5\xd1\x88";
273char* w = twochars;
274int cp = peek_next(w, twochars + 6);
275assert (cp == 0x65e5);
276assert (w == twochars);
277```
278
279In case of an invalid UTF-8 seqence, a `utf8::invalid_utf8` exception is thrown.
280
281#### utf8::prior
282
283Available in version 1.02 and later.
284
285Given a reference to an iterator pointing to an octet in a UTF-8 sequence, it decreases the iterator until it hits the beginning of the previous UTF-8 encoded code point and returns the 32 bits representation of the code point.
286
287```cpp
288template <typename octet_iterator>
289uint32_t prior(octet_iterator& it, octet_iterator start);
290```
291
292`octet_iterator`: a bidirectional iterator.
293`it`: a reference pointing to an octet within a UTF-8 encoded string. After the function returns, it is decremented to point to the beginning of the previous code point.
294`start`: an iterator to the beginning of the sequence where the search for the beginning of a code point is performed. It is a safety measure to prevent passing the beginning of the string in the search for a UTF-8 lead octet.
295 Return value: the 32 bit representation of the previous code point.
296
297Example of use:
298
299```cpp
300char* twochars = "\xe6\x97\xa5\xd1\x88";
301unsigned char* w = twochars + 3;
302int cp = prior (w, twochars);
303assert (cp == 0x65e5);
304assert (w == twochars);
305```
306
307This function has two purposes: one is two iterate backwards through a UTF-8 encoded string. Note that it is usually a better idea to iterate forward instead, since `utf8::next` is faster. The second purpose is to find a beginning of a UTF-8 sequence if we have a random position within a string. Note that in that case `utf8::prior` may not detect an invalid UTF-8 sequence in some scenarios: for instance if there are superfluous trail octets, it will just skip them.
308
309`it` will typically point to the beginning of a code point, and `start` will point to the beginning of the string to ensure we don't go backwards too far. `it` is decreased until it points to a lead UTF-8 octet, and then the UTF-8 sequence beginning with that octet is decoded to a 32 bit representation and returned.
310
311In case `start` is reached before a UTF-8 lead octet is hit, or if an invalid UTF-8 sequence is started by the lead octet, an `invalid_utf8` exception is thrown.
312
313In case `start` equals `it`, a `not_enough_room` exception is thrown.
314
315#### utf8::advance
316Available in version 1.0 and later.
317
318Advances an iterator by the specified number of code points within an UTF-8 sequence.
319
320```cpp
321template <typename octet_iterator, typename distance_type>
322void advance (octet_iterator& it, distance_type n, octet_iterator end);
323```
324
325`octet_iterator`: an input iterator.
326`distance_type`: an integral type convertible to `octet_iterator`'s difference type.
327`it`: a reference to an iterator pointing to the beginning of an UTF-8 encoded code point. After the function returns, it is incremented to point to the nth following code point.
328`n`: number of code points `it` should be advanced. A negative value means decrement.
329`end`: limit of the UTF-8 sequence to be processed. If `n` is positive and `it` gets equal to `end` during the extraction of a code point, an `utf8::not_enough_room` exception is thrown. If `n` is negative and `it` reaches `end` while `it` points t a trail byte of a UTF-8 sequence, a `utf8::invalid_code_point` exception is thrown.
330
331Example of use:
332
333```cpp
334char* twochars = "\xe6\x97\xa5\xd1\x88";
335unsigned char* w = twochars;
336advance (w, 2, twochars + 6);
337assert (w == twochars + 5);
338advance (w, -2, twochars);
339assert (w == twochars);
340```
341
342In case of an invalid code point, a `utf8::invalid_code_point` exception is thrown.
343
344#### utf8::distance
345
346Available in version 1.0 and later.
347
348Given the iterators to two UTF-8 encoded code points in a seqence, returns the number of code points between them.
349
350```cpp
351template <typename octet_iterator>
352typename std::iterator_traits<octet_iterator>::difference_type distance (octet_iterator first, octet_iterator last);
353```
354
355`octet_iterator`: an input iterator.
356`first`: an iterator to a beginning of a UTF-8 encoded code point.
357`last`: an iterator to a "post-end" of the last UTF-8 encoded code point in the sequence we are trying to determine the length. It can be the beginning of a new code point, or not.
358 Return value the distance between the iterators, in code points.
359
360Example of use:
361
362```cpp
363char* twochars = "\xe6\x97\xa5\xd1\x88";
364size_t dist = utf8::distance(twochars, twochars + 5);
365assert (dist == 2);
366```
367
368This function is used to find the length (in code points) of a UTF-8 encoded string. The reason it is called _distance_, rather than, say, _length_ is mainly because developers are used that _length_ is an O(1) function. Computing the length of an UTF-8 string is a linear operation, and it looked better to model it after `std::distance` algorithm.
369
370In case of an invalid UTF-8 seqence, a `utf8::invalid_utf8` exception is thrown. If `last` does not point to the past-of-end of a UTF-8 seqence, a `utf8::not_enough_room` exception is thrown.
371
372#### utf8::utf16to8
373
374Available in version 3.0 and later. Requires a C++ 11 compliant compiler.
375
376Converts a UTF-16 encoded string to UTF-8.
377
378```cpp
379std::string utf16to8(const std::u16string& s);
380```
381
382`s`: a UTF-16 encoded string.
383Return value: A UTF-8 encoded string.
384
385Example of use:
386
387```cpp
388    u16string utf16string = {0x41, 0x0448, 0x65e5, 0xd834, 0xdd1e};
389    string u = utf16to8(utf16string);
390    assert (u.size() == 10);
391```
392
393In case of invalid UTF-16 sequence, a `utf8::invalid_utf16` exception is thrown.
394
395
396#### utf8::utf16to8
397
398Available in version 1.0 and later.
399
400Converts a UTF-16 encoded string to UTF-8.
401
402```cpp
403template <typename u16bit_iterator, typename octet_iterator>
404octet_iterator utf16to8 (u16bit_iterator start, u16bit_iterator end, octet_iterator result);
405```
406
407`u16bit_iterator`: an input iterator.
408`octet_iterator`: an output iterator.
409`start`: an iterator pointing to the beginning of the UTF-16 encoded string to convert.
410`end`: an iterator pointing to pass-the-end of the UTF-16 encoded string to convert.
411`result`: an output iterator to the place in the UTF-8 string where to append the result of conversion.
412Return value: An iterator pointing to the place after the appended UTF-8 string.
413
414Example of use:
415
416```cpp
417unsigned short utf16string[] = {0x41, 0x0448, 0x65e5, 0xd834, 0xdd1e};
418vector<unsigned char> utf8result;
419utf16to8(utf16string, utf16string + 5, back_inserter(utf8result));
420assert (utf8result.size() == 10);
421```
422
423In case of invalid UTF-16 sequence, a `utf8::invalid_utf16` exception is thrown.
424
425#### utf8::utf8to16
426
427Available in version 3.0 and later. Requires a C++ 11 compliant compiler.
428
429Converts an UTF-8 encoded string to UTF-16.
430
431```cpp
432std::u16string utf8to16(const std::string& s);
433```
434
435`s`: an UTF-8 encoded string to convert.
436Return value: A UTF-16 encoded string
437
438Example of use:
439
440```cpp
441string utf8_with_surrogates = "\xe6\x97\xa5\xd1\x88\xf0\x9d\x84\x9e";
442u16string utf16result = utf8to16(utf8_with_surrogates);
443assert (utf16result.length() == 4);
444assert (utf16result[2] == 0xd834);
445assert (utf16result[3] == 0xdd1e);
446```
447
448In case of an invalid UTF-8 seqence, a `utf8::invalid_utf8` exception is thrown.
449
450#### utf8::utf8to16
451
452Available in version 1.0 and later.
453
454Converts an UTF-8 encoded string to UTF-16
455
456```cpp
457template <typename u16bit_iterator, typename octet_iterator>
458u16bit_iterator utf8to16 (octet_iterator start, octet_iterator end, u16bit_iterator result);
459```
460
461`octet_iterator`: an input iterator.
462`u16bit_iterator`: an output iterator.
463`start`: an iterator pointing to the beginning of the UTF-8 encoded string to convert. < br /> `end`: an iterator pointing to pass-the-end of the UTF-8 encoded string to convert.
464`result`: an output iterator to the place in the UTF-16 string where to append the result of conversion.
465Return value: An iterator pointing to the place after the appended UTF-16 string.
466
467Example of use:
468
469```cpp
470char utf8_with_surrogates[] = "\xe6\x97\xa5\xd1\x88\xf0\x9d\x84\x9e";
471vector <unsigned short> utf16result;
472utf8to16(utf8_with_surrogates, utf8_with_surrogates + 9, back_inserter(utf16result));
473assert (utf16result.size() == 4);
474assert (utf16result[2] == 0xd834);
475assert (utf16result[3] == 0xdd1e);
476```
477
478In case of an invalid UTF-8 seqence, a `utf8::invalid_utf8` exception is thrown. If `end` does not point to the past-of-end of a UTF-8 seqence, a `utf8::not_enough_room` exception is thrown.
479
480#### utf8::utf32to8
481
482Available in version 3.0 and later. Requires a C++ 11 compliant compiler.
483
484Converts a UTF-32 encoded string to UTF-8.
485
486```cpp
487std::string utf32to8(const std::u32string& s);
488```
489
490`s`: a UTF-32 encoded string.
491Return value: a UTF-8 encoded string.
492
493Example of use:
494
495```cpp
496u32string utf32string = {0x448, 0x65E5, 0x10346};
497string utf8result = utf32to8(utf32string);
498assert (utf8result.size() == 9);
499```
500
501In case of invalid UTF-32 string, a `utf8::invalid_code_point` exception is thrown.
502
503#### utf8::utf32to8
504
505Available in version 1.0 and later.
506
507Converts a UTF-32 encoded string to UTF-8.
508
509```cpp
510template <typename octet_iterator, typename u32bit_iterator>
511octet_iterator utf32to8 (u32bit_iterator start, u32bit_iterator end, octet_iterator result);
512```
513
514`octet_iterator`: an output iterator.
515`u32bit_iterator`: an input iterator.
516`start`: an iterator pointing to the beginning of the UTF-32 encoded string to convert.
517`end`: an iterator pointing to pass-the-end of the UTF-32 encoded string to convert.
518`result`: an output iterator to the place in the UTF-8 string where to append the result of conversion.
519Return value: An iterator pointing to the place after the appended UTF-8 string.
520
521Example of use:
522
523```cpp
524int utf32string[] = {0x448, 0x65E5, 0x10346, 0};
525vector<unsigned char> utf8result;
526utf32to8(utf32string, utf32string + 3, back_inserter(utf8result));
527assert (utf8result.size() == 9);
528```
529
530In case of invalid UTF-32 string, a `utf8::invalid_code_point` exception is thrown.
531
532#### utf8::utf8to32
533
534Available in version 3.0 and later. Requires a C++ 11 compliant compiler.
535
536Converts a UTF-8 encoded string to UTF-32.
537
538```cpp
539std::u32string utf8to32(const std::string& s);
540```
541
542`s`: a UTF-8 encoded string.
543Return value: a UTF-32 encoded string.
544
545Example of use:
546
547```cpp
548const char* twochars = "\xe6\x97\xa5\xd1\x88";
549u32string utf32result = utf8to32(twochars);
550assert (utf32result.size() == 2);
551```
552
553In case of an invalid UTF-8 seqence, a `utf8::invalid_utf8` exception is thrown.
554
555
556#### utf8::utf8to32
557
558Available in version 1.0 and later.
559
560Converts a UTF-8 encoded string to UTF-32.
561
562```cpp
563template <typename octet_iterator, typename u32bit_iterator>
564u32bit_iterator utf8to32 (octet_iterator start, octet_iterator end, u32bit_iterator result);
565```
566
567`octet_iterator`: an input iterator.
568`u32bit_iterator`: an output iterator.
569`start`: an iterator pointing to the beginning of the UTF-8 encoded string to convert.
570`end`: an iterator pointing to pass-the-end of the UTF-8 encoded string to convert.
571`result`: an output iterator to the place in the UTF-32 string where to append the result of conversion.
572Return value: An iterator pointing to the place after the appended UTF-32 string.
573
574Example of use:
575
576```cpp
577char* twochars = "\xe6\x97\xa5\xd1\x88";
578vector<int> utf32result;
579utf8to32(twochars, twochars + 5, back_inserter(utf32result));
580assert (utf32result.size() == 2);
581```
582
583In case of an invalid UTF-8 seqence, a `utf8::invalid_utf8` exception is thrown. If `end` does not point to the past-of-end of a UTF-8 seqence, a `utf8::not_enough_room` exception is thrown.
584
585#### utf8::find_invalid
586
587Available in version 3.0 and later. Requires a C++ 11 compliant compiler.
588
589Detects an invalid sequence within a UTF-8 string.
590
591```cpp
592std::size_t find_invalid(const std::string& s);
593```
594
595`s`: a UTF-8 encoded string.
596Return value: the index of the first invalid octet in the UTF-8 string. In case none were found, equals `std::string::npos`.
597
598Example of use:
599
600```cpp
601string utf_invalid = "\xe6\x97\xa5\xd1\x88\xfa";
602auto invalid = find_invalid(utf_invalid);
603assert (invalid == 5);
604```
605
606This function is typically used to make sure a UTF-8 string is valid before processing it with other functions. It is especially important to call it if before doing any of the _unchecked_ operations on it.
607
608#### utf8::find_invalid
609
610Available in version 1.0 and later.
611
612Detects an invalid sequence within a UTF-8 string.
613
614```cpp
615template <typename octet_iterator>
616octet_iterator find_invalid(octet_iterator start, octet_iterator end);
617```
618
619`octet_iterator`: an input iterator.
620`start`: an iterator pointing to the beginning of the UTF-8 string to test for validity.
621`end`: an iterator pointing to pass-the-end of the UTF-8 string to test for validity.
622Return value: an iterator pointing to the first invalid octet in the UTF-8 string. In case none were found, equals `end`.
623
624Example of use:
625
626```cpp
627char utf_invalid[] = "\xe6\x97\xa5\xd1\x88\xfa";
628char* invalid = find_invalid(utf_invalid, utf_invalid + 6);
629assert (invalid == utf_invalid + 5);
630```
631
632This function is typically used to make sure a UTF-8 string is valid before processing it with other functions. It is especially important to call it if before doing any of the _unchecked_ operations on it.
633
634#### utf8::is_valid
635
636Available in version 3.0 and later. Requires a C++ 11 compliant compiler.
637
638Checks whether a string object contains valid UTF-8 encoded text.
639
640```cpp
641bool is_valid(const std::string& s);
642```
643
644`s`: a UTF-8 encoded string.
645Return value: `true` if the string contains valid UTF-8 encoded text; `false` if not.
646
647Example of use:
648
649```cpp
650char utf_invalid[] = "\xe6\x97\xa5\xd1\x88\xfa";
651bool bvalid = is_valid(utf_invalid);
652assert (bvalid == false);
653```
654
655You may want to use `is_valid` to make sure that a string contains valid UTF-8 text without the need to know where it fails if it is not valid.
656
657#### utf8::is_valid
658
659Available in version 1.0 and later.
660
661Checks whether a sequence of octets is a valid UTF-8 string.
662
663```cpp
664template <typename octet_iterator>
665bool is_valid(octet_iterator start, octet_iterator end);
666```
667
668`octet_iterator`: an input iterator.
669`start`: an iterator pointing to the beginning of the UTF-8 string to test for validity.
670`end`: an iterator pointing to pass-the-end of the UTF-8 string to test for validity.
671Return value: `true` if the sequence is a valid UTF-8 string; `false` if not.
672
673Example of use:
674
675```cpp
676char utf_invalid[] = "\xe6\x97\xa5\xd1\x88\xfa";
677bool bvalid = is_valid(utf_invalid, utf_invalid + 6);
678assert (bvalid == false);
679```
680
681`is_valid` is a shorthand for `find_invalid(start, end) == end;`. You may want to use it to make sure that a byte seqence is a valid UTF-8 string without the need to know where it fails if it is not valid.
682
683#### utf8::replace_invalid
684
685Available in version 3.0 and later. Requires a C++ 11 compliant compiler.
686
687Replaces all invalid UTF-8 sequences within a string with a replacement marker.
688
689```cpp
690std::string replace_invalid(const std::string& s, char32_t replacement);
691std::string replace_invalid(const std::string& s);
692```
693
694`s`: a UTF-8 encoded string.
695`replacement`: A Unicode code point for the replacement marker. The version without this parameter assumes the value `0xfffd`
696Return value: A UTF-8 encoded string with replaced invalid sequences.
697
698Example of use:
699
700```cpp
701string invalid_sequence = "a\x80\xe0\xa0\xc0\xaf\xed\xa0\x80z";
702string replace_invalid_result = replace_invalid(invalid_sequence, '?');
703bvalid = is_valid(replace_invalid_result);
704assert (bvalid);
705const string fixed_invalid_sequence = "a????z";
706assert (fixed_invalid_sequence == replace_invalid_result);
707```
708
709#### utf8::replace_invalid
710
711Available in version 2.0 and later.
712
713Replaces all invalid UTF-8 sequences within a string with a replacement marker.
714
715```cpp
716template <typename octet_iterator, typename output_iterator>
717output_iterator replace_invalid(octet_iterator start, octet_iterator end, output_iterator out, uint32_t replacement);
718template <typename octet_iterator, typename output_iterator>
719output_iterator replace_invalid(octet_iterator start, octet_iterator end, output_iterator out);
720```
721
722`octet_iterator`: an input iterator.
723`output_iterator`: an output iterator.
724`start`: an iterator pointing to the beginning of the UTF-8 string to look for invalid UTF-8 sequences.
725`end`: an iterator pointing to pass-the-end of the UTF-8 string to look for invalid UTF-8 sequences.
726`out`: An output iterator to the range where the result of replacement is stored.
727`replacement`: A Unicode code point for the replacement marker. The version without this parameter assumes the value `0xfffd`
728Return value: An iterator pointing to the place after the UTF-8 string with replaced invalid sequences.
729
730Example of use:
731
732```cpp
733char invalid_sequence[] = "a\x80\xe0\xa0\xc0\xaf\xed\xa0\x80z";
734vector<char> replace_invalid_result;
735replace_invalid (invalid_sequence, invalid_sequence + sizeof(invalid_sequence), back_inserter(replace_invalid_result), '?');
736bvalid = is_valid(replace_invalid_result.begin(), replace_invalid_result.end());
737assert (bvalid);
738char* fixed_invalid_sequence = "a????z";
739assert (std::equal(replace_invalid_result.begin(), replace_invalid_result.end(), fixed_invalid_sequence));
740```
741
742`replace_invalid` does not perform in-place replacement of invalid sequences. Rather, it produces a copy of the original string with the invalid sequences replaced with a replacement marker. Therefore, `out` must not be in the `[start, end]` range.
743
744#### utf8::starts_with_bom
745
746Available in version 3.0 and later. Requires a C++ 11 compliant compiler.
747
748Checks whether a string starts with a UTF-8 byte order mark (BOM)
749
750```cpp
751bool starts_with_bom(const std::string& s);
752```
753
754`s`: a UTF-8 encoded string.
755Return value: `true` if the string starts with a UTF-8 byte order mark; `false` if not.
756
757Example of use:
758
759```cpp
760string byte_order_mark = {char(0xef), char(0xbb), char(0xbf)};
761bool bbom = starts_with_bom(byte_order_mark);
762assert (bbom == true);
763string threechars = "\xf0\x90\x8d\x86\xe6\x97\xa5\xd1\x88";
764bool no_bbom = starts_with_bom(threechars);
765assert (no_bbom == false);
766 ```
767
768The typical use of this function is to check the first three bytes of a file. If they form the UTF-8 BOM, we want to skip them before processing the actual UTF-8 encoded text.
769
770
771#### utf8::starts_with_bom
772
773Available in version 2.3 and later.
774
775Checks whether an octet sequence starts with a UTF-8 byte order mark (BOM)
776
777```cpp
778template <typename octet_iterator>
779bool starts_with_bom (octet_iterator it, octet_iterator end);
780```
781
782`octet_iterator`: an input iterator.
783`it`: beginning of the octet sequence to check
784`end`: pass-end of the sequence to check
785Return value: `true` if the sequence starts with a UTF-8 byte order mark; `false` if not.
786
787Example of use:
788
789```cpp
790unsigned char byte_order_mark[] = {0xef, 0xbb, 0xbf};
791bool bbom = starts_with_bom(byte_order_mark, byte_order_mark + sizeof(byte_order_mark));
792assert (bbom == true);
793```
794
795The typical use of this function is to check the first three bytes of a file. If they form the UTF-8 BOM, we want to skip them before processing the actual UTF-8 encoded text.
796
797### Types From utf8 Namespace
798
799#### utf8::exception
800
801Available in version 2.3 and later.
802
803Base class for the exceptions thrown by UTF CPP library functions.
804
805```cpp
806class exception : public std::exception {};
807```
808
809Example of use:
810
811```cpp
812try {
813  code_that_uses_utf_cpp_library();
814}
815catch(const utf8::exception& utfcpp_ex) {
816  cerr << utfcpp_ex.what();
817}
818```
819
820#### utf8::invalid_code_point
821
822Available in version 1.0 and later.
823
824Thrown by UTF8 CPP functions such as `advance` and `next` if an UTF-8 sequence represents and invalid code point.
825
826```cpp
827class invalid_code_point : public exception {
828public:
829    uint32_t code_point() const;
830};
831```
832
833Member function `code_point()` can be used to determine the invalid code point that caused the exception to be thrown.
834
835#### utf8::invalid_utf8
836
837Available in version 1.0 and later.
838
839Thrown by UTF8 CPP functions such as `next` and `prior` if an invalid UTF-8 sequence is detected during decoding.
840
841```cpp
842class invalid_utf8 : public exception {
843public:
844    uint8_t utf8_octet() const;
845};
846```
847
848Member function `utf8_octet()` can be used to determine the beginning of the byte sequence that caused the exception to be thrown.
849
850#### utf8::invalid_utf16
851
852Available in version 1.0 and later.
853
854Thrown by UTF8 CPP function `utf16to8` if an invalid UTF-16 sequence is detected during decoding.
855
856```cpp
857class invalid_utf16 : public exception {
858public:
859    uint16_t utf16_word() const;
860};
861```
862
863Member function `utf16_word()` can be used to determine the UTF-16 code unit that caused the exception to be thrown.
864
865#### utf8::not_enough_room
866
867Available in version 1.0 and later.
868
869Thrown by UTF8 CPP functions such as `next` if the end of the decoded UTF-8 sequence was reached before the code point was decoded.
870
871```cpp
872class not_enough_room : public exception {};
873```
874
875#### utf8::iterator
876
877Available in version 2.0 and later.
878
879Adapts the underlying octet iterator to iterate over the sequence of code points, rather than raw octets.
880
881```cpp
882template <typename octet_iterator>
883class iterator;
884```
885
886##### Member functions
887
888`iterator();` the deafult constructor; the underlying octet_iterator is constructed with its default constructor.
889
890`explicit iterator (const octet_iterator& octet_it, const octet_iterator& range_start, const octet_iterator& range_end);` a constructor that initializes the underlying octet_iterator with octet_it and sets the range in which the iterator is considered valid.
891
892`octet_iterator base () const;` returns the underlying octet_iterator.
893
894`uint32_t operator * () const;` decodes the utf-8 sequence the underlying octet_iterator is pointing to and returns the code point.
895
896`bool operator == (const iterator& rhs) const;` returns `true` if the two underlaying iterators are equal.
897
898`bool operator != (const iterator& rhs) const;` returns `true` if the two underlaying iterators are not equal.
899
900`iterator& operator ++ ();` the prefix increment - moves the iterator to the next UTF-8 encoded code point.
901
902`iterator operator ++ (int);` the postfix increment - moves the iterator to the next UTF-8 encoded code point and returns the current one.
903
904`iterator& operator -- ();` the prefix decrement - moves the iterator to the previous UTF-8 encoded code point.
905
906`iterator operator -- (int);` the postfix decrement - moves the iterator to the previous UTF-8 encoded code point and returns the current one.
907
908Example of use:
909
910```cpp
911char* threechars = "\xf0\x90\x8d\x86\xe6\x97\xa5\xd1\x88";
912utf8::iterator<char*> it(threechars, threechars, threechars + 9);
913utf8::iterator<char*> it2 = it;
914assert (it2 == it);
915assert (*it == 0x10346);
916assert (*(++it) == 0x65e5);
917assert ((*it++) == 0x65e5);
918assert (*it == 0x0448);
919assert (it != it2);
920utf8::iterator<char*> endit (threechars + 9, threechars, threechars + 9);
921assert (++it == endit);
922assert (*(--it) == 0x0448);
923assert ((*it--) == 0x0448);
924assert (*it == 0x65e5);
925assert (--it == utf8::iterator<char*>(threechars, threechars, threechars + 9));
926assert (*it == 0x10346);
927```
928
929The purpose of `utf8::iterator` adapter is to enable easy iteration as well as the use of STL algorithms with UTF-8 encoded strings. Increment and decrement operators are implemented in terms of `utf8::next()` and `utf8::prior()` functions.
930
931Note that `utf8::iterator` adapter is a checked iterator. It operates on the range specified in the constructor; any attempt to go out of that range will result in an exception. Even the comparison operators require both iterator object to be constructed against the same range - otherwise an exception is thrown. Typically, the range will be determined by sequence container functions `begin` and `end`, i.e.:
932
933```cpp
934std::string s = "example";
935utf8::iterator i (s.begin(), s.begin(), s.end());
936```
937
938### Functions From utf8::unchecked Namespace
939
940#### utf8::unchecked::append
941
942Available in version 1.0 and later.
943
944Encodes a 32 bit code point as a UTF-8 sequence of octets and appends the sequence to a UTF-8 string.
945
946```cpp
947template <typename octet_iterator>
948octet_iterator append(uint32_t cp, octet_iterator result);
949```
950
951`cp`: A 32 bit integer representing a code point to append to the sequence.
952`result`: An output iterator to the place in the sequence where to append the code point.
953Return value: An iterator pointing to the place after the newly appended sequence.
954
955Example of use:
956
957```cpp
958unsigned char u[5] = {0,0,0,0,0};
959unsigned char* end = unchecked::append(0x0448, u);
960assert (u[0] == 0xd1 && u[1] == 0x88 && u[2] == 0 && u[3] == 0 && u[4] == 0);
961```
962
963This is a faster but less safe version of `utf8::append`. It does not check for validity of the supplied code point, and may produce an invalid UTF-8 sequence.
964
965#### utf8::unchecked::next
966
967Available in version 1.0 and later.
968
969Given the iterator to the beginning of a UTF-8 sequence, it returns the code point and moves the iterator to the next position.
970
971```cpp
972template <typename octet_iterator>
973uint32_t next(octet_iterator& it);
974```
975
976`it`: a reference to an iterator pointing to the beginning of an UTF-8 encoded code point. After the function returns, it is incremented to point to the beginning of the next code point.
977 Return value: the 32 bit representation of the processed UTF-8 code point.
978
979Example of use:
980
981```cpp
982char* twochars = "\xe6\x97\xa5\xd1\x88";
983char* w = twochars;
984int cp = unchecked::next(w);
985assert (cp == 0x65e5);
986assert (w == twochars + 3);
987```
988
989This is a faster but less safe version of `utf8::next`. It does not check for validity of the supplied UTF-8 sequence.
990
991#### utf8::unchecked::peek_next
992
993Available in version 2.1 and later.
994
995Given the iterator to the beginning of a UTF-8 sequence, it returns the code point.
996
997```cpp
998template <typename octet_iterator>
999uint32_t peek_next(octet_iterator it);
1000```
1001
1002`it`: an iterator pointing to the beginning of an UTF-8 encoded code point.
1003Return value: the 32 bit representation of the processed UTF-8 code point.
1004
1005Example of use:
1006
1007```cpp
1008char* twochars = "\xe6\x97\xa5\xd1\x88";
1009char* w = twochars;
1010int cp = unchecked::peek_next(w);
1011assert (cp == 0x65e5);
1012assert (w == twochars);
1013```
1014
1015This is a faster but less safe version of `utf8::peek_next`. It does not check for validity of the supplied UTF-8 sequence.
1016
1017#### utf8::unchecked::prior
1018
1019Available in version 1.02 and later.
1020
1021Given a reference to an iterator pointing to an octet in a UTF-8 seqence, it decreases the iterator until it hits the beginning of the previous UTF-8 encoded code point and returns the 32 bits representation of the code point.
1022
1023```cpp
1024template <typename octet_iterator>
1025uint32_t prior(octet_iterator& it);
1026```
1027
1028`it`: a reference pointing to an octet within a UTF-8 encoded string. After the function returns, it is decremented to point to the beginning of the previous code point.
1029 Return value: the 32 bit representation of the previous code point.
1030
1031Example of use:
1032
1033```cpp
1034char* twochars = "\xe6\x97\xa5\xd1\x88";
1035char* w = twochars + 3;
1036int cp = unchecked::prior (w);
1037assert (cp == 0x65e5);
1038assert (w == twochars);
1039```
1040
1041This is a faster but less safe version of `utf8::prior`. It does not check for validity of the supplied UTF-8 sequence and offers no boundary checking.
1042
1043#### utf8::unchecked::advance
1044
1045Available in version 1.0 and later.
1046
1047Advances an iterator by the specified number of code points within an UTF-8 sequence.
1048
1049```cpp
1050template <typename octet_iterator, typename distance_type>
1051void advance (octet_iterator& it, distance_type n);
1052```
1053
1054`it`: a reference to an iterator pointing to the beginning of an UTF-8 encoded code point. After the function returns, it is incremented to point to the nth following code point.
1055`n`: number of code points `it` should be advanced. A negative value means decrement.
1056
1057Example of use:
1058
1059```cpp
1060char* twochars = "\xe6\x97\xa5\xd1\x88";
1061char* w = twochars;
1062unchecked::advance (w, 2);
1063assert (w == twochars + 5);
1064```
1065
1066This is a faster but less safe version of `utf8::advance`. It does not check for validity of the supplied UTF-8 sequence and offers no boundary checking.
1067
1068#### utf8::unchecked::distance
1069
1070Available in version 1.0 and later.
1071
1072Given the iterators to two UTF-8 encoded code points in a seqence, returns the number of code points between them.
1073
1074```cpp
1075template <typename octet_iterator>
1076typename std::iterator_traits<octet_iterator>::difference_type distance (octet_iterator first, octet_iterator last);
1077```
1078
1079`first`: an iterator to a beginning of a UTF-8 encoded code point.
1080`last`: an iterator to a "post-end" of the last UTF-8 encoded code point in the sequence we are trying to determine the length. It can be the beginning of a new code point, or not.
1081Return value: the distance between the iterators, in code points.
1082
1083Example of use:
1084
1085```cpp
1086char* twochars = "\xe6\x97\xa5\xd1\x88";
1087size_t dist = utf8::unchecked::distance(twochars, twochars + 5);
1088assert (dist == 2);
1089```
1090
1091This is a faster but less safe version of `utf8::distance`. It does not check for validity of the supplied UTF-8 sequence.
1092
1093#### utf8::unchecked::utf16to8
1094
1095Available in version 1.0 and later.
1096
1097Converts a UTF-16 encoded string to UTF-8.
1098
1099```cpp
1100template <typename u16bit_iterator, typename octet_iterator>
1101octet_iterator utf16to8 (u16bit_iterator start, u16bit_iterator end, octet_iterator result);
1102```
1103
1104`start`: an iterator pointing to the beginning of the UTF-16 encoded string to convert.
1105`end`: an iterator pointing to pass-the-end of the UTF-16 encoded string to convert.
1106`result`: an output iterator to the place in the UTF-8 string where to append the result of conversion.
1107Return value: An iterator pointing to the place after the appended UTF-8 string.
1108
1109Example of use:
1110
1111```cpp
1112unsigned short utf16string[] = {0x41, 0x0448, 0x65e5, 0xd834, 0xdd1e};
1113vector<unsigned char> utf8result;
1114unchecked::utf16to8(utf16string, utf16string + 5, back_inserter(utf8result));
1115assert (utf8result.size() == 10);
1116```
1117
1118This is a faster but less safe version of `utf8::utf16to8`. It does not check for validity of the supplied UTF-16 sequence.
1119
1120#### utf8::unchecked::utf8to16
1121
1122Available in version 1.0 and later.
1123
1124Converts an UTF-8 encoded string to UTF-16
1125
1126```cpp
1127template <typename u16bit_iterator, typename octet_iterator>
1128u16bit_iterator utf8to16 (octet_iterator start, octet_iterator end, u16bit_iterator result);
1129```
1130
1131`start`: an iterator pointing to the beginning of the UTF-8 encoded string to convert. < br /> `end`: an iterator pointing to pass-the-end of the UTF-8 encoded string to convert.
1132`result`: an output iterator to the place in the UTF-16 string where to append the result of conversion.
1133Return value: An iterator pointing to the place after the appended UTF-16 string.
1134
1135Example of use:
1136
1137```cpp
1138char utf8_with_surrogates[] = "\xe6\x97\xa5\xd1\x88\xf0\x9d\x84\x9e";
1139vector <unsigned short> utf16result;
1140unchecked::utf8to16(utf8_with_surrogates, utf8_with_surrogates + 9, back_inserter(utf16result));
1141assert (utf16result.size() == 4);
1142assert (utf16result[2] == 0xd834);
1143assert (utf16result[3] == 0xdd1e);
1144```
1145
1146This is a faster but less safe version of `utf8::utf8to16`. It does not check for validity of the supplied UTF-8 sequence.
1147
1148#### utf8::unchecked::utf32to8
1149
1150Available in version 1.0 and later.
1151
1152Converts a UTF-32 encoded string to UTF-8.
1153
1154```cpp
1155template <typename octet_iterator, typename u32bit_iterator>
1156octet_iterator utf32to8 (u32bit_iterator start, u32bit_iterator end, octet_iterator result);
1157```
1158
1159`start`: an iterator pointing to the beginning of the UTF-32 encoded string to convert.
1160`end`: an iterator pointing to pass-the-end of the UTF-32 encoded string to convert.
1161`result`: an output iterator to the place in the UTF-8 string where to append the result of conversion.
1162Return value: An iterator pointing to the place after the appended UTF-8 string.
1163
1164Example of use:
1165
1166```cpp
1167int utf32string[] = {0x448, 0x65e5, 0x10346, 0};
1168vector<unsigned char> utf8result;
1169utf32to8(utf32string, utf32string + 3, back_inserter(utf8result));
1170assert (utf8result.size() == 9);
1171```
1172
1173This is a faster but less safe version of `utf8::utf32to8`. It does not check for validity of the supplied UTF-32 sequence.
1174
1175#### utf8::unchecked::utf8to32
1176
1177Available in version 1.0 and later.
1178
1179Converts a UTF-8 encoded string to UTF-32.
1180
1181```cpp
1182template <typename octet_iterator, typename u32bit_iterator>
1183u32bit_iterator utf8to32 (octet_iterator start, octet_iterator end, u32bit_iterator result);
1184```
1185
1186`start`: an iterator pointing to the beginning of the UTF-8 encoded string to convert.
1187`end`: an iterator pointing to pass-the-end of the UTF-8 encoded string to convert.
1188`result`: an output iterator to the place in the UTF-32 string where to append the result of conversion.
1189Return value: An iterator pointing to the place after the appended UTF-32 string.
1190
1191Example of use:
1192
1193```cpp
1194char* twochars = "\xe6\x97\xa5\xd1\x88";
1195vector<int> utf32result;
1196unchecked::utf8to32(twochars, twochars + 5, back_inserter(utf32result));
1197assert (utf32result.size() == 2);
1198```
1199
1200This is a faster but less safe version of `utf8::utf8to32`. It does not check for validity of the supplied UTF-8 sequence.
1201
1202#### utf8::unchecked::replace_invalid
1203
1204Available in version 3.1 and later.
1205
1206Replaces all invalid UTF-8 sequences within a string with a replacement marker.
1207
1208```cpp
1209template <typename octet_iterator, typename output_iterator>
1210output_iterator replace_invalid(octet_iterator start, octet_iterator end, output_iterator out, uint32_t replacement);
1211template <typename octet_iterator, typename output_iterator>
1212output_iterator replace_invalid(octet_iterator start, octet_iterator end, output_iterator out);
1213```
1214
1215`octet_iterator`: an input iterator.
1216`output_iterator`: an output iterator.
1217`start`: an iterator pointing to the beginning of the UTF-8 string to look for invalid UTF-8 sequences.
1218`end`: an iterator pointing to pass-the-end of the UTF-8 string to look for invalid UTF-8 sequences.
1219`out`: An output iterator to the range where the result of replacement is stored.
1220`replacement`: A Unicode code point for the replacement marker. The version without this parameter assumes the value `0xfffd`
1221Return value: An iterator pointing to the place after the UTF-8 string with replaced invalid sequences.
1222
1223Example of use:
1224
1225```cpp
1226char invalid_sequence[] = "a\x80\xe0\xa0\xc0\xaf\xed\xa0\x80z";
1227vector<char> replace_invalid_result;
1228unchecked::replace_invalid (invalid_sequence, invalid_sequence + sizeof(invalid_sequence), back_inserter(replace_invalid_result), '?');
1229bvalid = utf8::is_valid(replace_invalid_result.begin(), replace_invalid_result.end());
1230assert (bvalid);
1231char* fixed_invalid_sequence = "a????z";
1232assert (std::equal(replace_invalid_result.begin(), replace_invalid_result.end(), fixed_invalid_sequence));
1233```
1234
1235`replace_invalid` does not perform in-place replacement of invalid sequences. Rather, it produces a copy of the original string with the invalid sequences replaced with a replacement marker. Therefore, `out` must not be in the `[start, end]` range.
1236
1237Unlike `utf8::replace_invalid`, this function does not verify validity of the replacement marker.
1238
1239### Types From utf8::unchecked Namespace
1240
1241#### utf8::iterator
1242
1243Available in version 2.0 and later.
1244
1245Adapts the underlying octet iterator to iterate over the sequence of code points, rather than raw octets.
1246
1247```cpp
1248template <typename octet_iterator>
1249class iterator;
1250```
1251
1252##### Member functions
1253
1254`iterator();` the deafult constructor; the underlying octet_iterator is constructed with its default constructor.
1255
1256`explicit iterator (const octet_iterator& octet_it);` a constructor that initializes the underlying octet_iterator with `octet_it`.
1257
1258`octet_iterator base () const;` returns the underlying octet_iterator.
1259
1260`uint32_t operator * () const;` decodes the utf-8 sequence the underlying octet_iterator is pointing to and returns the code point.
1261
1262`bool operator == (const iterator& rhs) const;` returns `true` if the two underlaying iterators are equal.
1263
1264`bool operator != (const iterator& rhs) const;` returns `true` if the two underlaying iterators are not equal.
1265
1266`iterator& operator ++ ();` the prefix increment - moves the iterator to the next UTF-8 encoded code point.
1267
1268`iterator operator ++ (int);` the postfix increment - moves the iterator to the next UTF-8 encoded code point and returns the current one.
1269
1270`iterator& operator -- ();` the prefix decrement - moves the iterator to the previous UTF-8 encoded code point.
1271
1272`iterator operator -- (int);` the postfix decrement - moves the iterator to the previous UTF-8 encoded code point and returns the current one.
1273
1274Example of use:
1275
1276```cpp
1277char* threechars = "\xf0\x90\x8d\x86\xe6\x97\xa5\xd1\x88";
1278utf8::unchecked::iterator<char*> un_it(threechars);
1279utf8::unchecked::iterator<char*> un_it2 = un_it;
1280assert (un_it2 == un_it);
1281assert (*un_it == 0x10346);
1282assert (*(++un_it) == 0x65e5);
1283assert ((*un_it++) == 0x65e5);
1284assert (*un_it == 0x0448);
1285assert (un_it != un_it2);
1286utf8::::unchecked::iterator<char*> un_endit (threechars + 9);
1287assert (++un_it == un_endit);
1288assert (*(--un_it) == 0x0448);
1289assert ((*un_it--) == 0x0448);
1290assert (*un_it == 0x65e5);
1291assert (--un_it == utf8::unchecked::iterator<char*>(threechars));
1292assert (*un_it == 0x10346);
1293```
1294
1295This is an unchecked version of `utf8::iterator`. It is faster in many cases, but offers no validity or range checks.
1296
1297## Links
1298
12991.  [The Unicode Consortium](http://www.unicode.org/).
13002.  [ICU Library](http://icu.sourceforge.net/).
13013.  [UTF-8 at Wikipedia](http://en.wikipedia.org/wiki/UTF-8)
13024.  [UTF-8 and Unicode FAQ for Unix/Linux](http://www.cl.cam.ac.uk/~mgk25/unicode.html)
1303