1\input texinfo          @c -*-texinfo-*-
2@comment %**start of header
3@setfilename libunistring.info
4@documentencoding UTF-8
5@settitle GNU libunistring
6@finalout
7@c Indices:
8@c   am = autoconf macro  @amindex
9@c   cp = concept         @cindex
10@c   fn = function        @findex
11@c   tp = type            @tindex
12@c Unused predefined indices:
13@c   ky = keystroke       @kindex
14@c   pg = program         @pindex
15@c   vr = variable        @vindex
16@defcodeindex am
17@syncodeindex am cp
18@syncodeindex fn cp
19@syncodeindex tp cp
20@ifclear texi2html
21@firstparagraphindent insert
22@end ifclear
23@c texi2html-1.76 does not support @arrow{}.
24@ifset texi2html
25@macro arrow{}
2627@end macro
28@end ifset
29@comment %**end of header
30
31@include version.texi
32
33@c Location of the POSIX specification on the web.
34@set POSIXURL http://pubs.opengroup.org/onlinepubs/9699919799
35
36@c Macro for referencing a POSIX header.
37@ifinfo
38@macro posixheader{header}
39@code{<\header\>}
40@end macro
41@end ifinfo
42@ifnotinfo
43@macro posixheader{header}
44@uref{@value{POSIXURL}/basedefs/\header\.html,,@code{<\header\>}}
45@end macro
46@end ifnotinfo
47
48@c Macro for referencing a POSIX function.
49@c We don't write it as func(), see section "GNU Manuals" of the
50@c GNU coding standards.
51@ifinfo
52@macro posixfunc{func}
53@code{\func\}
54@end macro
55@end ifinfo
56@ifnotinfo
57@macro posixfunc{func}
58@uref{@value{POSIXURL}/functions/\func\.html,,@code{\func\}}
59@end macro
60@end ifnotinfo
61
62@c Macro for referencing a normal function.
63@c We don't write it as func(), see section "GNU Manuals" of the
64@c GNU coding standards.
65@macro func{func}
66@code{\func\}
67@end macro
68
69@c Macro for an advisory ragged line break in TeX mode.
70@c Needed because there are long unbreakable pieces of text (such as URLs or
71@c formulas), TeX is too shy to move them to a new line. TeX considers only
72@c two choices: a line break in aligned mode (which it rejects due to aesthetic
73@c reasons) and writing into the margin. What we want in many cases is a line
74@c break without filling the first line. Like what @* delivers. But we want it
75@c only when needed, so that it disappears when unrelated changes in the same
76@c paragraph cause a line break in a nearby position. And we need it only in
77@c TeX mode. info and HTML modes are fine.
78@c This trick is from Karl Berry.
79@iftex
80@macro texnl
81@hfil@penalty9000@hfilneg
82@end macro
83@end iftex
84@ifnottex
85@macro texnl
86@end macro
87@end ifnottex
88
89@ifinfo
90@dircategory Software development
91@direntry
92* GNU libunistring: (libunistring).     Unicode string library.
93@end direntry
94@end ifinfo
95
96@ifinfo
97This manual is for GNU libunistring.
98
99@ignore
100@c This was: @copying but it triggers a makeinfo 4.13 bug
101Copyright (C) 2001-2018 Free Software Foundation, Inc.
102
103This manual is free documentation.  It is dually licensed under the
104GNU FDL and the GNU GPL.  This means that you can redistribute this
105manual under either of these two licenses, at your choice.
106
107This manual is covered by the GNU FDL.  Permission is granted to copy,
108distribute and/or modify this document under the terms of the
109GNU Free Documentation License (FDL), either version 1.2 of the
110License, or (at your option) any later version published by the
111Free Software Foundation (FSF); with no Invariant Sections, with no
112Front-Cover Text, and with no Back-Cover Texts.
113A copy of the license is included in @ref{GNU FDL}.
114
115This manual is covered by the GNU GPL.  You can redistribute it and/or
116modify it under the terms of the GNU General Public License (GPL), either
117version 3 of the License, or (at your option) any later version published
118by the Free Software Foundation (FSF).
119A copy of the license is included in @ref{GNU GPL}.
120@end ignore
121@end ifinfo
122
123@titlepage
124@title GNU libunistring, version @value{VERSION}
125@subtitle updated @value{UPDATED}
126@subtitle Edition @value{EDITION}, @value{UPDATED}
127@author Bruno Haible
128
129@ifnothtml
130@page
131@vskip 0pt plus 1filll
132@c @insertcopying
133Copyright (C) 2001-2018 Free Software Foundation, Inc.
134
135This manual is free documentation.  It is dually licensed under the
136GNU FDL and the GNU GPL.  This means that you can redistribute this
137manual under either of these two licenses, at your choice.
138
139This manual is covered by the GNU FDL.  Permission is granted to copy,
140distribute and/or modify this document under the terms of the
141GNU Free Documentation License (FDL), either version 1.2 of the
142License, or (at your option) any later version published by the
143Free Software Foundation (FSF); with no Invariant Sections, with no
144Front-Cover Text, and with no Back-Cover Texts.
145A copy of the license is included in @ref{GNU FDL}.
146
147This manual is covered by the GNU GPL.  You can redistribute it and/or
148modify it under the terms of the GNU General Public License (GPL), either
149version 3 of the License, or (at your option) any later version published
150by the Free Software Foundation (FSF).
151A copy of the license is included in @ref{GNU GPL}.
152@end ifnothtml
153@end titlepage
154
155@c Table of Contents
156@contents
157
158@ifnottex
159@node Top
160@top GNU libunistring
161@end ifnottex
162
163@menu
164* Introduction::                Who may need Unicode strings?
165* Conventions::                 Conventions used in this manual
166* unitypes.h::                  Elementary types
167* unistr.h::                    Elementary Unicode string functions
168* uniconv.h::                   Conversions between Unicode and encodings
169* unistdio.h::                  Output with Unicode strings
170* uniname.h::                   Names of Unicode characters
171* unictype.h::                  Unicode character classification and properties
172* uniwidth.h::                  Display width
173* unigbrk.h::                   Grapheme cluster breaking
174* uniwbrk.h::                   Word breaks in strings
175* unilbrk.h::                   Line breaking
176* uninorm.h::                   Normalization forms
177* unicase.h::                   Case mappings
178* uniregex.h::                  Regular expressions
179* Using the library::           How to link with the library and use it?
180* More functionality::          More advanced functionality
181* The wchar_t mess::            Why @code{wchar_t *} strings are useless
182* Licenses::                    Licenses
183
184* Index::                       General Index
185
186@detailmenu
187 --- The Detailed Node Listing ---
188
189Introduction
190
191* Unicode::                     What is Unicode?
192* Unicode and i18n::            Unicode and internationalization
193* Locale encodings::            What is a locale encoding?
194* In-memory representation::    How to represent strings in memory?
195* char * strings::              What to keep in mind with @code{char *} strings
196* Unicode strings::             How are Unicode strings represented?
197
198unistr.h
199
200* Elementary string checks::
201* Elementary string conversions::
202* Elementary string functions::
203* Elementary string functions with memory allocation::
204* Elementary string functions on NUL terminated strings::
205
206Elementary string functions
207
208* Iterating::
209* Creating Unicode strings::
210* Copying Unicode strings::
211* Comparing Unicode strings::
212* Searching for a character::
213* Counting characters::
214
215Elementary string functions on NUL terminated strings
216
217* Iterating over a NUL terminated Unicode string::
218* Length::
219* Copying a NUL terminated Unicode string::
220* Comparing NUL terminated Unicode strings::
221* Duplicating a NUL terminated Unicode string::
222* Searching for a character in a NUL terminated Unicode string::
223* Searching for a substring::
224* Tokenizing::
225
226unictype.h
227
228* General category::
229* Canonical combining class::
230* Bidi class::
231* Decimal digit value::
232* Digit value::
233* Numeric value::
234* Mirrored character::
235* Arabic shaping::
236* Properties::
237* Scripts::
238* Blocks::
239* ISO C and Java syntax::
240* Classifications like in ISO C::
241
242General category
243
244* Object oriented API::
245* Bit mask API::
246
247Properties
248
249* Properties as objects::
250* Properties as functions::
251
252unigbrk.h
253
254* Grapheme cluster breaks in a string::
255* Grapheme cluster break property::
256
257uniwbrk.h
258
259* Word breaks in a string::
260* Word break property::
261
262uninorm.h
263
264* Decomposition of characters::
265* Composition of characters::
266* Normalization of strings::
267* Normalizing comparisons::
268* Normalization of streams::
269
270unicase,h
271
272* Case mappings of characters::
273* Case mappings of strings::
274* Case mappings of substrings::
275* Case insensitive comparison::
276* Case detection::
277
278Using the library
279
280* Installation::
281* Compiler options::
282* Include files::
283* Autoconf macro::
284* Reporting problems::
285
286Licenses
287
288* GNU GPL::                     GNU General Public License
289* GNU LGPL::                    GNU Lesser General Public License
290* GNU FDL::                     GNU Free Documentation License
291
292@end detailmenu
293@end menu
294
295@node Introduction
296@chapter Introduction
297
298This library provides functions for manipulating Unicode strings and
299for manipulating C strings according to the Unicode standard.
300
301It consists of the following parts:
302
303@table @code
304@item <unistr.h>
305elementary string functions
306@item <uniconv.h>
307conversion from/to legacy encodings
308@item <unistdio.h>
309formatted output to strings
310@item <uniname.h>
311character names
312@item <unictype.h>
313character classification and properties
314@item <uniwidth.h>
315string width when using nonproportional fonts
316@item <unigbrk.h>
317grapheme cluster breaks
318@item <uniwbrk.h>
319word breaks
320@item <unilbrk.h>
321line breaking algorithm
322@item <uninorm.h>
323normalization (composition and decomposition)
324@item <unicase.h>
325case folding
326@item <uniregex.h>
327regular expressions (not yet implemented)
328@end table
329
330@cindex use cases
331@cindex value, of libunistring
332libunistring is for you if your application involves non-trivial text
333processing, such as upper/lower case conversions, line breaking, operations
334on words, or more advanced analysis of text.  Text provided by the user can,
335in general, contain characters of all kinds of scripts.  The text processing
336functions provided by this library handle all scripts and all languages.
337
338libunistring is for you if your application already uses the ISO C / POSIX
339@posixheader{ctype.h}, @posixheader{wctype.h} functions and the text it
340operates on is provided by the user and can be in any language.
341
342libunistring is also for you if your application uses Unicode strings as
343internal in-memory representation.
344
345@menu
346* Unicode::                     What is Unicode?
347* Unicode and i18n::            Unicode and internationalization
348* Locale encodings::            What is a locale encoding?
349* In-memory representation::    How to represent strings in memory?
350* char * strings::              What to keep in mind with @code{char *} strings
351* Unicode strings::             How are Unicode strings represented?
352@end menu
353
354@node Unicode
355@section Unicode
356
357@cindex Unicode
358Unicode is a standardized repertoire of characters that contains characters
359from all scripts of the world, from Latin letters to Chinese ideographs
360and Babylonian cuneiform glyphs.  It also specifies how these characters
361are to be rendered on a screen or on paper, and how common text processing
362(word selection, line breaking, uppercasing of page titles etc.) is supposed
363to behave on Unicode text.
364
365Unicode also specifies three ways of storing sequences of Unicode
366characters in a computer whose basic unit of data is an 8-bit byte:
367@cindex UTF-8
368@cindex UTF-16
369@cindex UTF-32
370@cindex UCS-4
371@table @asis
372@item UTF-8
373Every character is represented as 1 to 4 bytes.
374@item UTF-16
375Every character is represented as 1 to 2 units of 16 bits.
376@item UTF-32, a.k.a@. UCS-4
377Every character is represented as 1 unit of 32 bits.
378@end table
379
380For encoding Unicode text in a file, UTF-8 is usually used.  For encoding
381Unicode strings in memory for a program, either of the three encoding forms
382can be reasonably used.
383
384Unicode is widely used on the web.  Prior to the use of Unicode, web pages
385were in many different encodings (ISO-8859-1 for English, French, Spanish,
386ISO-8859-2 for Polish, ISO-8859-7 for Greek, KOI8-R for Russian, GB2312 or
387BIG5 for Chinese, ISO-2022-JP-2 or EUC-JP or Shift_JIS for Japanese, and many
388many others).  It was next to impossible to create a document that contained
389Chinese and Polish text in the same document.  Due to the many encodings for
390Japanese, even the processing of pure Japanese text was error prone.
391
392References:
393@itemize @bullet
394@item
395The Unicode standard:@texnl{} @url{http://www.unicode.org/}
396@item
397Definition of UTF-8:@texnl{} @url{http://www.rfc-editor.org/rfc/rfc3629.txt}
398@item
399Definition of UTF-16:@texnl{} @url{http://www.rfc-editor.org/rfc/rfc2781.txt}
400@item
401Markus Kuhn's UTF-8 and Unicode FAQ:@texnl{}
402@url{http://www.cl.cam.ac.uk/~mgk25/unicode.html}
403@end itemize
404
405@node Unicode and i18n
406@section Unicode and Internationalization
407
408@cindex internationalization
409Internationalization is the process of changing the source code of a program
410so that it can meet the expectations of users in any culture, if culture
411specific data (translations, images etc.) are provided.
412
413Use of Unicode is not strictly required for internationalization, but it
414makes internationalization much easier, because operations that need to
415look at specific characters (like hyphenation, spell checking, or the
416automatic conversion of double-quotes to opening and closing double-quote
417characters) don't need to consider multiple possible encodings of the text.
418
419Use of Unicode also enables multilingualization: the ability of having text
420in multiple languages present in the same document or even in the same line
421of text.
422
423But use of Unicode is not everything.  Internationalization usually consists
424of four features:
425@itemize @bullet
426@item
427Use of Unicode where needed for text processing.  This is what this library
428is for.
429@item
430Use of message catalogs for messages shown to the user, This is what
431GNU gettext is about.
432@item
433Use of locale specific conventions for date and time formats, for numeric
434formatting, or for sorting of text.  This can be done adequately with the
435POSIX APIs and the implementation of locales in the GNU C library.
436@item
437In graphical user interfaces, adapting the GUI to the default text direction
438of the current locale (see
439@url{https://en.wikipedia.org/wiki/Right-to-left,right-to-left languages}).
440@end itemize
441
442@node Locale encodings
443@section Locale encodings
444
445@cindex locale
446A locale is a set of cultural conventions.  According to POSIX, for a program,
447at any moment, there is one locale being designated as the ``current locale''.
448(Actually, POSIX supports also one locale per thread, but this feature is not
449yet universally implemented and not widely used.)
450@cindex locale categories
451The locale is partitioned into several aspects, called the ``categories''
452of the locale.  The main various aspects are:
453@itemize @bullet
454@item
455The character encoding and the character properties.  This is the
456@code{LC_CTYPE} category.
457@item
458The sorting rules for text.  This is the @code{LC_COLLATE} category.
459@item
460The language specific translations of messages.  This is the
461@code{LC_MESSAGES} category.
462@item
463The formatting rules for numbers, such as the decimal separator.  This is
464the @code{LC_NUMERIC} category.
465@item
466The formatting rules for amounts of money.  This is the @code{LC_MONETARY}
467category.
468@item
469The formatting of date and time.  This is the @code{LC_TIME} category.
470@end itemize
471
472@cindex locale encoding
473In particular, the @code{LC_CTYPE} category of the current locale determines
474the character encoding.  This is the encoding of @samp{char *} strings.
475We also call it the ``locale encoding''.  GNU libunistring has a function,
476@func{locale_charset}, that returns a standardized (platform independent)
477name for this encoding.
478
479All locale encodings used on glibc systems are essentially ASCII compatible:
480Most graphic ASCII characters have the same representation, as a single byte,
481in that encoding as in ASCII.
482
483Among the possible locale encodings are UTF-8 and GB18030.  Both allow
484to represent any Unicode character as a sequence of bytes.  UTF-8 is used in
485most of the world, whereas GB18030 is used in the People's Republic of China,
486because it is backward compatible with the GB2312 encoding that was used in
487this country earlier.
488
489The legacy locale encodings, ISO-8859-15 (which supplanted ISO-8859-1 in
490most of Europe), ISO-8859-2, KOI8-R, EUC-JP, etc., are still in use in
491some places, though.
492
493UTF-16 and UTF-32 are not used as locale encodings, because they are not
494ASCII compatible.
495
496@node In-memory representation
497@section Choice of in-memory representation of strings
498
499There are three ways of representing strings in memory of a running
500program.
501@itemize @bullet
502@item
503As @samp{char *} strings.  Such strings are represented in locale encoding.
504This approach is employed when not much text processing is done by the
505program.  When some Unicode aware processing is to be done, a string is
506converted to Unicode on the fly and back to locale encoding afterwards.
507@item
508As UTF-8 or UTF-16 or UTF-32 strings.  This implies that conversion from
509locale encoding to Unicode is performed on input, and in the opposite
510direction on output.  This approach is employed when the program does
511a significant amount of text processing, or when the program has multiple
512threads operating on the same data but in different locales.
513@item
514As @samp{wchar_t *}, a.k.a@. ``wide strings''.  This approach is misguided,
515see @ref{The wchar_t mess}.
516@end itemize
517
518Of course, a @samp{char *} string can, in some cases, be encoded in UTF-8.
519You will use the data type depending on what you can guarantee about how
520it's encoded: If a string is encoded in the locale encoding, or if you
521don't know how it's encoded, use @samp{char *}.  If, on the other hand,
522you can @emph{guarantee} that it is UTF-8 encoded, then you can use the
523UTF-8 string type, @code{uint8_t *}, for it.
524
525The five types @code{char *}, @code{uint8_t *}, @code{uint16_t *},
526@code{uint32_t *}, and @code{wchar_t *} are incompatible types at the C
527level.  Therefore, @samp{gcc -Wall} will produce a warning if, by mistake,
528your code contains a mismatch between these types.  In the context of
529using GNU libunistring, even a warning about a mismatch between
530@code{char *} and @code{uint8_t *} is a sign of a bug in your code
531that you should not try to silence through a cast.
532
533@node char * strings
534@section @samp{char *} strings
535
536@cindex C string functions
537The classical C strings, with its C library support standardized by
538ISO C and POSIX, can be used in internationalized programs with some
539precautions.  The problem with this API is that many of the C library
540functions for strings don't work correctly on strings in locale
541encodings, leading to bugs that only people in some cultures of the
542world will experience.
543
544@cindex locale, multibyte
545The first problem with the C library API is the support of multibyte
546locales.  According to the locale encoding, in general, every character
547is represented by one or more bytes (up to 4 bytes in practice --- but
548use @code{MB_LEN_MAX} instead of the number 4 in the code).
549When every character is represented by only 1 byte, we speak of an
550``unibyte locale'', otherwise of a ``multibyte locale''.  It is important
551to realize that the majority of Unix installations nowadays use UTF-8
552or GB18030 as locale encoding; therefore, the majority of users are
553using multibyte locales.
554
555@cindex char, type
556The important fact to remember is:
557@cartouche
558@emph{A @samp{char} is a byte, not a character.}
559@end cartouche
560
561As a consequence:
562@itemize @bullet
563@item
564The @posixheader{ctype.h} API is useless in this context; it does not work in
565multibyte locales.
566@item
567The @posixfunc{strlen} function does not return the number of characters
568in a string.  Nor does it return the number of screen columns occupied
569by a string after it is output.  It merely returns the number of
570@emph{bytes} occupied by a string.
571@item
572Truncating a string, for example, with @posixfunc{strncpy}, can have the
573effect of truncating it in the middle of a multibyte character.  Such
574a string will, when output, have a garbled character at its end, often
575represented by a hollow box.
576@item
577@posixfunc{strchr} and @posixfunc{strrchr} do not work with multibyte strings
578if the locale encoding is GB18030 and the character to be searched is
579a digit.
580@item
581@posixfunc{strstr} does not work with multibyte strings if the locale encoding
582is different from UTF-8.
583@item
584@posixfunc{strcspn}, @posixfunc{strpbrk}, @posixfunc{strspn} cannot work
585correctly in multibyte locales: they assume the second argument is a list of
586single-byte characters.  Even in this simple case, they do not work with
587multibyte strings if the locale encoding is GB18030 and one of the
588characters to be searched is a digit.
589@item
590@posixfunc{strsep} and @posixfunc{strtok_r} do not work with multibyte strings
591unless all of the delimiter characters are ASCII characters < 0x30.
592@item
593The @posixfunc{strcasecmp}, @posixfunc{strncasecmp}, and @posixfunc{strcasestr}
594functions do not work with multibyte strings.
595@end itemize
596
597The workarounds can be found in GNU gnulib
598@url{http://www.gnu.org/software/gnulib/}.
599@itemize @bullet
600@item
601gnulib has modules @samp{mbchar}, @samp{mbiter}, @samp{mbuiter} that
602represent multibyte characters and allow to iterate across a multibyte
603string with the same ease as through a unibyte string.
604@item
605gnulib has functions @func{mbslen} and @func{mbswidth} that can be
606used instead of @posixfunc{strlen} when the number of characters or the
607number of screen columns of a string is requested.
608@item
609gnulib has functions @func{mbschr} and @func{mbsrrchr} that are
610like @posixfunc{strchr} and @posixfunc{strrchr}, but work in multibyte locales.
611@item
612gnulib has a function @func{mbsstr}, like @posixfunc{strstr}, but works
613in multibyte locales.
614@item
615gnulib has functions @func{mbscspn}, @func{mbspbrk}, @func{mbsspn}
616that are like @posixfunc{strcspn}, @posixfunc{strpbrk}, @posixfunc{strspn}, but
617work in multibyte locales.
618@item
619gnulib has functions @func{mbssep} and @func{mbstok_r} that are
620like @posixfunc{strsep} and @posixfunc{strtok_r} but work in multibyte locales.
621@item
622gnulib has functions @func{mbscasecmp}, @func{mbsncasecmp},
623@func{mbspcasecmp}, and @func{mbscasestr} that are like @posixfunc{strcasecmp},
624@posixfunc{strncasecmp}, and @posixfunc{strcasestr}, but
625work in multibyte locales.  Still, the function @code{ulc_casecmp} is
626preferable to these functions; see below.
627@end itemize
628
629The second problem with the C library API is that it has some assumptions built-in that are not valid in some languages:
630@itemize @bullet
631@item
632It assumes that there are only two forms of every character: uppercase
633and lowercase.  This is not true for Croatian, where the character
634@sc{LETTER DZ WITH CARON} comes in three forms:
635@sc{LATIN CAPITAL LETTER DZ WITH CARON} (DZ),
636@sc{LATIN CAPITAL LETTER D WITH SMALL LETTER Z WITH CARON} (Dz),
637@sc{LATIN SMALL LETTER DZ WITH CARON} (dz).
638@item
639It assumes that uppercasing of 1 character leads to 1 character.  This
640is not true for German, where the @sc{LATIN SMALL LETTER SHARP S}, when
641uppercased, becomes @samp{SS}.
642@item
643It assumes that there is 1:1 mapping between uppercase and lowercase forms.
644This is not true for the Greek sigma: @sc{GREEK CAPITAL LETTER SIGMA} is
645the uppercase of both @sc{GREEK SMALL LETTER SIGMA} and
646@sc{GREEK SMALL LETTER FINAL SIGMA}.
647@item
648It assumes that the upper/lowercase mappings are position independent.
649This is not true for the Greek sigma and the Lithuanian i.
650@end itemize
651
652The correct way to deal with this problem is
653@enumerate
654@item
655to provide functions for titlecasing, as well as for upper- and
656lowercasing,
657@item
658to view case transformations as functions that operates on strings,
659rather than on characters.
660@end enumerate
661
662This is implemented in this library, through the functions declared in @code{<unicase.h>}, see @ref{unicase.h}.
663
664@node Unicode strings
665@section Unicode strings
666
667libunistring supports Unicode strings in three representations:
668@cindex UTF-8, strings
669@cindex UTF-16, strings
670@cindex UTF-32, strings
671@itemize @bullet
672@item
673UTF-8 strings, through the type @samp{uint8_t *}.  The units are bytes
674(@code{uint8_t}).
675@item
676UTF-16 strings, through the type @samp{uint16_t *},  The units are 16-bit
677memory words (@code{uint16_t}).
678@item
679UTF-32 strings, through the type @samp{uint32_t *}.  The units are 32-bit
680memory words (@code{uint32_t}).
681@end itemize
682
683As with C strings, there are two variants:
684@itemize @bullet
685@item
686Unicode strings with a terminating NUL character are represented as
687a pointer to the first unit of the string.  There is a unit containing
688a 0 value at the end.  It is considered part of the string for all
689memory allocation purposes, but is not considered part of the string
690for all other logical purposes.
691@item
692Unicode strings where embedded NUL characters are allowed.  These
693are represented by a pointer to the first unit and the number of units
694(not bytes!) of the string.  In this setting, there is no trailing
695zero-valued unit used as ``end marker''.
696@end itemize
697
698@node Conventions
699@chapter Conventions
700
701This chapter explains conventions valid throughout the libunistring library.
702
703@cindex argument conventions
704Variables of type @code{char *} denote C strings in locale encoding.
705See @ref{Locale encodings}.
706
707Variables of type @code{uint8_t *} denote UTF-8 strings.  Their units
708are bytes.
709
710Variables of type @code{uint16_t *} denote UTF-16 strings, without byte
711order mark.  Their units are 2-byte words.
712
713Variables of type @code{uint32_t *} denote UTF-32 strings, without byte
714order mark.  Their units are 4-byte words.
715
716Argument pairs @code{(@var{s}, @var{n})} denote a string
717@code{@var{s}[0..@var{n}-1]} with exactly @var{n} units.
718
719All functions with prefix @samp{ulc_} operate on C strings in locale
720encoding.
721
722All functions with prefix @samp{u8_} operate on UTF-8 strings.
723
724All functions with prefix @samp{u16_} operate on UTF-16 strings.
725
726All functions with prefix @samp{u32_} operate on UTF-32 strings.
727
728For every function with prefix @samp{u8_}, operating on UTF-8 strings,
729there is also a corresponding function with prefix @samp{u16_},
730operating on UTF-16 strings, and a corresponding function with prefix
731@samp{u32_}, operating on UTF-32 strings.  Their description is
732analogous; in this documentation we describe only the function that
733operates on UTF-8 strings, for brevity.
734
735A declaration with a variable @var{n} denotes the three concrete
736declarations with @var{n} = 8, @var{n} = 16, @var{n} = 32.
737
738All parameters starting with @samp{str} and the parameters of
739functions starting with @code{u8_str}/@code{u16_str}/@code{u32_str}
740denote a NUL terminated string.
741
742@cindex return value conventions
743Error values are always returned through the @code{errno} variable,
744usually with a return value that indicates the presence of an error
745(NULL for functions that return an pointer, or -1 for functions that
746return an @code{int}).
747
748Functions returning a string result take a
749@code{(@var{resultbuf}, @var{lengthp})}
750argument pair.  If @var{resultbuf} is not NULL and the result fits
751into @code{*@var{lengthp}} units, it is put in @var{resultbuf}, and
752@var{resultbuf} is returned.  Otherwise, a freshly allocated string
753is returned.  In both cases, @code{*@var{lengthp}} is set to the
754length (number of units) of the returned string.  In case of error,
755NULL is returned and @code{errno} is set.
756
757@include unitypes.texi
758@include unistr.texi
759@include uniconv.texi
760@include unistdio.texi
761@include uniname.texi
762@include unictype.texi
763@include uniwidth.texi
764@include unigbrk.texi
765@include uniwbrk.texi
766@include unilbrk.texi
767@include uninorm.texi
768@include unicase.texi
769@include uniregex.texi
770
771@node Using the library
772@chapter Using the library
773
774This chapter explains some practical considerations, regarding the
775installation and compiler options that are needed in order to use this
776library.
777
778@menu
779* Installation::
780* Compiler options::
781* Include files::
782* Autoconf macro::
783* Reporting problems::
784@end menu
785
786@node Installation
787@section Installation
788
789@cindex dependencies
790Before you can use the library, it must be installed.  First, you have to
791make sure all dependencies are installed.  They are listed in the file
792@file{DEPENDENCIES}.
793
794@cindex installation
795Then you can proceed to build and install the library, as described in the
796file @file{INSTALL}.  For installation on Windows systems, please refer to
797the file @file{INSTALL.windows}.
798
799@node Compiler options
800@section Compiler options
801
802Let's denote as @code{LIBUNISTRING_PREFIX} the value of the @samp{--prefix}
803option that you passed to @code{configure} while installing this package.
804If you didn't pass any @samp{--prefix} option, then the package is installed
805in @file{/usr/local}.
806
807Let's denote as @code{LIBUNISTRING_INCLUDEDIR} the directory where the
808include files were installed.  This is usually the same as
809@code{$@{LIBUNISTRING_PREFIX@}/include}.  Except that if you passed an
810@samp{--includedir} option to @code{configure}, it is the value of that
811option.
812
813Let's further denote as @code{LIBUNISTRING_LIBDIR} the directory where
814the library itself was installed.  This is the value that you passed
815with the @samp{--libdir} option to @code{configure}, or otherwise the
816same as @code{$@{LIBUNISTRING_PREFIX@}/lib}.  Recall that when building
817in 64-bit mode on a 64-bit GNU/Linux system that supports executables
818in either 64-bit mode or 32-bit mode, you should have used the option
819@code{--libdir=$@{LIBUNISTRING_PREFIX@}/lib64}.
820
821@cindex compiler options
822So that the compiler finds the include files, you have to pass it the
823option @code{-I$@{LIBUNISTRING_INCLUDEDIR@}}.
824
825So that the compiler finds the library during its linking pass, you have
826to pass it the options @code{-L$@{LIBUNISTRING_LIBDIR@} -lunistring}.
827On some systems, in some configurations, you also have to pass options
828needed for linking with @code{libiconv}.  The autoconf macro
829@code{gl_LIBUNISTRING} (see @ref{Autoconf macro}) deals with this
830particularity.
831
832@node Include files
833@section Include files
834
835Most of the include files have been presented in the introduction, see
836@ref{Introduction}, and subsequent detailed chapters.
837
838Another include file is @code{<unistring/version.h>}. It contains the
839version number of the libunistring library.
840
841@deftypevr Macro int _LIBUNISTRING_VERSION
842This constant contains the version of libunistring that is being used
843at compile time.  It encodes the major and minor parts of the version
844number only.  These parts are encoded in the form @code{(major<<8) + minor}.
845@end deftypevr
846
847@deftypevr Constant int _libunistring_version
848This constant contains the version of libunistring that is being used
849at run time.  It encodes the major and minor parts of the version
850number only.  These parts are encoded in the form @code{(major<<8) + minor}.
851@end deftypevr
852
853It is possible that @code{_libunistring_version} is greater than
854@code{_LIBUNISTRING_VERSION}.  This can happen when you use
855@code{libunistring} as a shared library, and a newer, binary
856backward-compatible version has been installed after your program
857that uses @code{libunistring} was installed.
858
859@node Autoconf macro
860@section Autoconf macro
861
862@cindex autoconf macro
863GNU Gnulib provides an autoconf macro that tests for the availability
864of @code{libunistring}.  It is contained in the Gnulib module
865@samp{libunistring}, see@texnl{}
866@url{http://www.gnu.org/software/gnulib/MODULES.html#module=libunistring}.
867
868@amindex gl_LIBUNISTRING
869The macro is called @code{gl_LIBUNISTRING}.  It searches for an installed
870libunistring.  If found, it sets and AC_SUBSTs @code{HAVE_LIBUNISTRING=yes}
871and the @code{LIBUNISTRING} and @code{LTLIBUNISTRING} variables and augments
872the @code{CPPFLAGS} variable, and defines the C macro
873@code{HAVE_LIBUNISTRING} to 1.  Otherwise, it sets and AC_SUBSTs
874@code{HAVE_LIBUNISTRING=no} and @code{LIBUNISTRING} and @code{LTLIBUNISTRING}
875to empty.
876
877The complexities that @code{gl_LIBUNISTRING} deals with are the following:
878
879@itemize @bullet
880@item
881On some operating systems, in some configurations, libunistring depends
882on @code{libiconv}, and the options for linking with libiconv must be
883mentioned explicitly on the link command line.
884
885@item
886GNU @code{libunistring}, if installed, is not necessarily already in the
887search path (@code{CPPFLAGS} for the include file search path,
888@code{LDFLAGS} for the library search path).
889
890@item
891GNU @code{libunistring}, if installed, is not necessarily already in the
892run time library search path.  To avoid the need for setting an environment
893variable like @code{LD_LIBRARY_PATH}, the macro adds the appropriate
894run time search path options to the @code{LIBUNISTRING} variable.  This works
895on most systems.
896@end itemize
897
898@node Reporting problems
899@section Reporting problems
900
901@cindex bug reports
902@cindex bug tracker
903@cindex mailing list
904If you encounter any problem, please don't hesitate to send a detailed
905bug report to the @code{bug-libunistring@@gnu.org} mailing list.  You can
906alternatively also use the bug tracker at the project page
907@url{https://savannah.gnu.org/projects/libunistring}.
908
909Please always include the version number of this library, and a short
910description of your operating system and compilation environment with
911corresponding version numbers.
912
913For problems that appear while building and installing @code{libunistring},
914for which you don't find the remedy in the @file{INSTALL} file, please include
915a description of the options that you passed to the @samp{configure} script.
916
917@node More functionality
918@chapter More advanced functionality
919
920@cindex bidirectional reordering
921For bidirectional reordering of strings, we recommend the GNU FriBidi library:
922@url{http://www.fribidi.org/}.
923
924@cindex rendering
925For the rendering of Unicode strings outside of the context of a given toolkit
926(KDE/Qt or GNOME/Gtk), we recommend the Pango library:
927@url{http://www.pango.org/}.
928
929@include wchar_t.texi
930
931@node Licenses
932@appendix Licenses
933@cindex Licenses
934
935The files of this package are covered by the licenses indicated in each
936particular file or directory.  Here is a summary:
937
938@itemize @bullet
939@item
940The @code{libunistring} library and its header files are dual-licensed under
941"the GNU LGPLv3+ or the GNU GPLv2". This means, you can use it under either
942@itemize @bullet
943@item @minus{}
944the terms of the GNU Lesser General Public License (LGPL) version 3 or
945(at your option) any later version, or
946@item @minus{}
947the terms of the GNU General Public License (GPL) version 2, or
948@item @minus{}
949the same dual license "the GNU LGPLv3+ or the GNU GPLv2".
950@end itemize
951You find the GNU LGPL version 3 in @ref{GNU LGPL}.  This license is
952based on the GNU GPL version 3, see @ref{GNU GPL}.
953@*
954You can find the GNU GPL version 2 at
955@url{https://www.gnu.org/licenses/old-licenses/gpl-2.0.html}.
956@*
957Note: This dual license makes it possible for the @code{libunistring} library
958to be used by packages under GPLv2 or GPLv2+ licenses, in particular. See
959the table in @url{https://www.gnu.org/licenses/gpl-faq.html#AllCompatibility}.
960
961
962@item
963This manual is free documentation.  It is dually licensed under the
964GNU FDL and the GNU GPL.  This means that you can redistribute this
965manual under either of these two licenses, at your choice.
966@*
967This manual is covered by the GNU FDL.  Permission is granted to copy,
968distribute and/or modify this document under the terms of the
969GNU Free Documentation License (FDL), either version 1.2 of the
970License, or (at your option) any later version published by the
971Free Software Foundation (FSF); with no Invariant Sections, with no
972Front-Cover Text, and with no Back-Cover Texts.
973A copy of the license is included in @ref{GNU FDL}.
974@*
975This manual is covered by the GNU GPL.  You can redistribute it and/or
976modify it under the terms of the GNU General Public License (GPL), either
977version 3 of the License, or (at your option) any later version published
978by the Free Software Foundation (FSF).
979A copy of the license is included in @ref{GNU GPL}.
980@end itemize
981
982@menu
983* GNU GPL::                     GNU General Public License
984* GNU LGPL::                    GNU Lesser General Public License
985* GNU FDL::                     GNU Free Documentation License
986@end menu
987
988@page
989@node GNU GPL
990@appendixsec GNU GENERAL PUBLIC LICENSE
991@cindex GPL, GNU General Public License
992@cindex License, GNU GPL
993@include gpl.texi
994@page
995@node GNU LGPL
996@appendixsec GNU LESSER GENERAL PUBLIC LICENSE
997@cindex LGPL, GNU Lesser General Public License
998@cindex License, GNU LGPL
999@include lgpl.texi
1000@page
1001@node GNU FDL
1002@appendixsec GNU Free Documentation License
1003@cindex FDL, GNU Free Documentation License
1004@cindex License, GNU FDL
1005@include fdl.texi
1006
1007@node Index
1008@unnumbered Index
1009
1010@printindex cp
1011
1012@bye
1013
1014@c Local Variables:
1015@c indent-tabs-mode: nil
1016@c whitespace-check-buffer-indent: nil
1017@c End:
1018