1.\" $OpenBSD: utf8.7,v 1.7 2018/05/17 16:44:23 schwarze Exp $ 2.\" 3.\" Copyright (c) 2017 Ted Unangst <tedu@openbsd.org> 4.\" 5.\" Permission to use, copy, modify, and distribute this software for any 6.\" purpose with or without fee is hereby granted, provided that the above 7.\" copyright notice and this permission notice appear in all copies. 8.\" 9.\" THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES 10.\" WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF 11.\" MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR 12.\" ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES 13.\" WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN 14.\" ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF 15.\" OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE. 16.\" 17.Dd $Mdocdate: May 17 2018 $ 18.Dt UTF8 7 19.Os 20.Sh NAME 21.Nm utf8 22.Nd UTF-8 text encoding 23.Sh DESCRIPTION 24UTF-8 is a multibyte character encoding for Unicode text. 25It is the preferred format for non ASCII text. 26.Pp 27Unicode codepoints are encoded as follows: 28.Bl -tag -width Ds 29.It U+0000 \(en U+007F: 30One byte: 0....... (compatible with ASCII) 31.It U+0080 \(en U+07FF: 32Two bytes: 110..... 10...... 33.It U+0800 \(en U+D7FF and U+E000 \(en U+FFFF: 34Three bytes: 1110.... 10...... 10...... 35.It U+10000 \(en U+10FFFF: 36Four bytes: 11110... 10...... 10...... 10...... 37.El 38.Pp 39The bits shown as dots contain the codepoint represented as a binary 40integer. 41.Pp 42Bytes starting with the bit pattern 11...... are called UTF-8 start 43bytes, and those starting with 10...... UTF-8 continuation bytes. 44The number of leading 1 bits in a start byte indicates the total 45number of bytes used to encode the codepoint, including the start 46byte. 47.Pp 48Encodings using more bytes than required are invalid. 49In particular, 11000000 and 11000001 are not valid start bytes, 50the byte after 11100000 must be at least 10100000, 51and the byte after 11110000 must be at least 10010000. 52.Sh SEE ALSO 53.Xr locale 1 , 54.Xr ascii 7 55.Sh STANDARDS 56.Rs 57.%A F. Yergeau 58.%D November 2003 59.%R RFC 3629 60.%T UTF-8, a transformation format of ISO 10646 61.Re 62.Pp 63.Lk http://www.unicode.org/versions/latest/ "The Unicode Standard" 64.Pp 65.Lk http://www.unicode.org/reports/tr44/ "The Unicode Character Database" 66