xref: /openbsd/share/man/man7/utf8.7 (revision d415bd75)
1.\"	$OpenBSD: utf8.7,v 1.9 2022/02/18 10:24:32 jsg Exp $
2.\"
3.\" Copyright (c) 2017 Ted Unangst <tedu@openbsd.org>
4.\"
5.\" Permission to use, copy, modify, and distribute this software for any
6.\" purpose with or without fee is hereby granted, provided that the above
7.\" copyright notice and this permission notice appear in all copies.
8.\"
9.\" THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES
10.\" WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF
11.\" MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR
12.\" ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
13.\" WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN
14.\" ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF
15.\" OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
16.\"
17.Dd $Mdocdate: February 18 2022 $
18.Dt UTF8 7
19.Os
20.Sh NAME
21.Nm utf8
22.Nd UTF-8 text encoding
23.Sh DESCRIPTION
24UTF-8 is a multibyte character encoding for Unicode text.
25It is the preferred format for non ASCII text.
26.Pp
27Unicode codepoints are encoded as follows:
28.Bl -tag -width Ds
29.It U+0000 \(en U+007F:
30One byte: 0....... (compatible with ASCII)
31.It U+0080 \(en U+07FF:
32Two bytes: 110..... 10......
33.It U+0800 \(en U+D7FF and U+E000 \(en U+FFFF:
34Three bytes: 1110.... 10...... 10......
35.It U+10000 \(en U+10FFFF:
36Four bytes: 11110... 10...... 10...... 10......
37.El
38.Pp
39The bits shown as dots contain the codepoint represented as a binary
40integer.
41.Pp
42Bytes starting with the bit pattern 11...... are called UTF-8 start
43bytes, and those starting with 10...... UTF-8 continuation bytes.
44The number of leading 1 bits in a start byte indicates the total
45number of bytes used to encode the codepoint, including the start
46byte.
47.Pp
48Encodings using more bytes than required are invalid.
49In particular, 11000000 and 11000001 are not valid start bytes,
50the byte after 11100000 must be at least 10100000,
51and the byte after 11110000 must be at least 10010000.
52.Pp
53The ranges U+D800 to U+DFFF and U+110000 to U+1FFFFF
54do not contain valid Unicode codepoints.
55Consequently, the corresponding three- and four-byte UTF-8 sequences
56are invalid.
57The highest valid byte after 11101101 is 10011111,
58the highest valid byte of the form 1111.... is 11110100,
59and the highest valid byte after 11110100 is 10001111.
60.Pp
61To summarize, the following is a complete list of bytes
62that are invalid in all contexts:
63.Pp
64.Bl -tag -width 5n -offset 4n -compact
65.It c0\(enc1
66two-byte sequence that has to be encoded as a single byte
67.It f5\(enf7
68four-byte sequence beyond the Unicode range
69.It f8\(enff
70invalid sequence of five or more bytes
71.El
72.Pp
73The following is a complete list of invalid two-byte combinations
74of the form 11...... 10...... that consist of two valid bytes:
75.Pp
76.Bl -tag -width 9n -offset 4n -compact
77.It e080\(ene09f
78three-byte sequence that has to be encoded as two bytes
79.It eda0\(enedbf
80start of a UTF-16 surrogate, which is not valid UTF-8
81.It f080\(enf08f
82four-byte sequence that has to be encoded as three bytes
83.It f490\(enf4bf
84four-byte sequence beyond the Unicode range
85.El
86.Sh SEE ALSO
87.Xr locale 1 ,
88.Xr ascii 7
89.Sh STANDARDS
90.Rs
91.%A F. Yergeau
92.%D November 2003
93.%R RFC 3629
94.%T UTF-8, a transformation format of ISO 10646
95.Re
96.Pp
97.Lk https://www.unicode.org/versions/latest/ "The Unicode Standard"
98.Pp
99.Lk https://www.unicode.org/reports/tr44/ "The Unicode Character Database"
100