1\input texinfo @c -*-texinfo-*- 2@comment %**start of header 3@setfilename libunistring.info 4@documentencoding UTF-8 5@settitle GNU libunistring 6@finalout 7@c Indices: 8@c am = autoconf macro @amindex 9@c cp = concept @cindex 10@c fn = function @findex 11@c tp = type @tindex 12@c Unused predefined indices: 13@c ky = keystroke @kindex 14@c pg = program @pindex 15@c vr = variable @vindex 16@defcodeindex am 17@syncodeindex am cp 18@syncodeindex fn cp 19@syncodeindex tp cp 20@ifclear texi2html 21@firstparagraphindent insert 22@end ifclear 23@c texi2html-1.76 does not support @arrow{}. 24@ifset texi2html 25@macro arrow{} 26→ 27@end macro 28@end ifset 29@comment %**end of header 30 31@include version.texi 32 33@c Location of the POSIX specification on the web. 34@set POSIXURL http://pubs.opengroup.org/onlinepubs/9699919799 35 36@c Macro for referencing a POSIX header. 37@ifinfo 38@macro posixheader{header} 39@code{<\header\>} 40@end macro 41@end ifinfo 42@ifnotinfo 43@macro posixheader{header} 44@uref{@value{POSIXURL}/basedefs/\header\.html,,@code{<\header\>}} 45@end macro 46@end ifnotinfo 47 48@c Macro for referencing a POSIX function. 49@c We don't write it as func(), see section "GNU Manuals" of the 50@c GNU coding standards. 51@ifinfo 52@macro posixfunc{func} 53@code{\func\} 54@end macro 55@end ifinfo 56@ifnotinfo 57@macro posixfunc{func} 58@uref{@value{POSIXURL}/functions/\func\.html,,@code{\func\}} 59@end macro 60@end ifnotinfo 61 62@c Macro for referencing a normal function. 63@c We don't write it as func(), see section "GNU Manuals" of the 64@c GNU coding standards. 65@macro func{func} 66@code{\func\} 67@end macro 68 69@c Macro for an advisory ragged line break in TeX mode. 70@c Needed because there are long unbreakable pieces of text (such as URLs or 71@c formulas), TeX is too shy to move them to a new line. TeX considers only 72@c two choices: a line break in aligned mode (which it rejects due to aesthetic 73@c reasons) and writing into the margin. What we want in many cases is a line 74@c break without filling the first line. Like what @* delivers. But we want it 75@c only when needed, so that it disappears when unrelated changes in the same 76@c paragraph cause a line break in a nearby position. And we need it only in 77@c TeX mode. info and HTML modes are fine. 78@c This trick is from Karl Berry. 79@iftex 80@macro texnl 81@hfil@penalty9000@hfilneg 82@end macro 83@end iftex 84@ifnottex 85@macro texnl 86@end macro 87@end ifnottex 88 89@ifinfo 90@dircategory Software development 91@direntry 92* GNU libunistring: (libunistring). Unicode string library. 93@end direntry 94@end ifinfo 95 96@ifinfo 97This manual is for GNU libunistring. 98 99@ignore 100@c This was: @copying but it triggers a makeinfo 4.13 bug 101Copyright (C) 2001-2018 Free Software Foundation, Inc. 102 103This manual is free documentation. It is dually licensed under the 104GNU FDL and the GNU GPL. This means that you can redistribute this 105manual under either of these two licenses, at your choice. 106 107This manual is covered by the GNU FDL. Permission is granted to copy, 108distribute and/or modify this document under the terms of the 109GNU Free Documentation License (FDL), either version 1.2 of the 110License, or (at your option) any later version published by the 111Free Software Foundation (FSF); with no Invariant Sections, with no 112Front-Cover Text, and with no Back-Cover Texts. 113A copy of the license is included in @ref{GNU FDL}. 114 115This manual is covered by the GNU GPL. You can redistribute it and/or 116modify it under the terms of the GNU General Public License (GPL), either 117version 3 of the License, or (at your option) any later version published 118by the Free Software Foundation (FSF). 119A copy of the license is included in @ref{GNU GPL}. 120@end ignore 121@end ifinfo 122 123@titlepage 124@title GNU libunistring, version @value{VERSION} 125@subtitle updated @value{UPDATED} 126@subtitle Edition @value{EDITION}, @value{UPDATED} 127@author Bruno Haible 128 129@ifnothtml 130@page 131@vskip 0pt plus 1filll 132@c @insertcopying 133Copyright (C) 2001-2018 Free Software Foundation, Inc. 134 135This manual is free documentation. It is dually licensed under the 136GNU FDL and the GNU GPL. This means that you can redistribute this 137manual under either of these two licenses, at your choice. 138 139This manual is covered by the GNU FDL. Permission is granted to copy, 140distribute and/or modify this document under the terms of the 141GNU Free Documentation License (FDL), either version 1.2 of the 142License, or (at your option) any later version published by the 143Free Software Foundation (FSF); with no Invariant Sections, with no 144Front-Cover Text, and with no Back-Cover Texts. 145A copy of the license is included in @ref{GNU FDL}. 146 147This manual is covered by the GNU GPL. You can redistribute it and/or 148modify it under the terms of the GNU General Public License (GPL), either 149version 3 of the License, or (at your option) any later version published 150by the Free Software Foundation (FSF). 151A copy of the license is included in @ref{GNU GPL}. 152@end ifnothtml 153@end titlepage 154 155@c Table of Contents 156@contents 157 158@ifnottex 159@node Top 160@top GNU libunistring 161@end ifnottex 162 163@menu 164* Introduction:: Who may need Unicode strings? 165* Conventions:: Conventions used in this manual 166* unitypes.h:: Elementary types 167* unistr.h:: Elementary Unicode string functions 168* uniconv.h:: Conversions between Unicode and encodings 169* unistdio.h:: Output with Unicode strings 170* uniname.h:: Names of Unicode characters 171* unictype.h:: Unicode character classification and properties 172* uniwidth.h:: Display width 173* unigbrk.h:: Grapheme cluster breaking 174* uniwbrk.h:: Word breaks in strings 175* unilbrk.h:: Line breaking 176* uninorm.h:: Normalization forms 177* unicase.h:: Case mappings 178* uniregex.h:: Regular expressions 179* Using the library:: How to link with the library and use it? 180* More functionality:: More advanced functionality 181* The wchar_t mess:: Why @code{wchar_t *} strings are useless 182* Licenses:: Licenses 183 184* Index:: General Index 185 186@detailmenu 187 --- The Detailed Node Listing --- 188 189Introduction 190 191* Unicode:: What is Unicode? 192* Unicode and i18n:: Unicode and internationalization 193* Locale encodings:: What is a locale encoding? 194* In-memory representation:: How to represent strings in memory? 195* char * strings:: What to keep in mind with @code{char *} strings 196* Unicode strings:: How are Unicode strings represented? 197 198unistr.h 199 200* Elementary string checks:: 201* Elementary string conversions:: 202* Elementary string functions:: 203* Elementary string functions with memory allocation:: 204* Elementary string functions on NUL terminated strings:: 205 206Elementary string functions 207 208* Iterating:: 209* Creating Unicode strings:: 210* Copying Unicode strings:: 211* Comparing Unicode strings:: 212* Searching for a character:: 213* Counting characters:: 214 215Elementary string functions on NUL terminated strings 216 217* Iterating over a NUL terminated Unicode string:: 218* Length:: 219* Copying a NUL terminated Unicode string:: 220* Comparing NUL terminated Unicode strings:: 221* Duplicating a NUL terminated Unicode string:: 222* Searching for a character in a NUL terminated Unicode string:: 223* Searching for a substring:: 224* Tokenizing:: 225 226unictype.h 227 228* General category:: 229* Canonical combining class:: 230* Bidi class:: 231* Decimal digit value:: 232* Digit value:: 233* Numeric value:: 234* Mirrored character:: 235* Arabic shaping:: 236* Properties:: 237* Scripts:: 238* Blocks:: 239* ISO C and Java syntax:: 240* Classifications like in ISO C:: 241 242General category 243 244* Object oriented API:: 245* Bit mask API:: 246 247Properties 248 249* Properties as objects:: 250* Properties as functions:: 251 252unigbrk.h 253 254* Grapheme cluster breaks in a string:: 255* Grapheme cluster break property:: 256 257uniwbrk.h 258 259* Word breaks in a string:: 260* Word break property:: 261 262uninorm.h 263 264* Decomposition of characters:: 265* Composition of characters:: 266* Normalization of strings:: 267* Normalizing comparisons:: 268* Normalization of streams:: 269 270unicase,h 271 272* Case mappings of characters:: 273* Case mappings of strings:: 274* Case mappings of substrings:: 275* Case insensitive comparison:: 276* Case detection:: 277 278Using the library 279 280* Installation:: 281* Compiler options:: 282* Include files:: 283* Autoconf macro:: 284* Reporting problems:: 285 286Licenses 287 288* GNU GPL:: GNU General Public License 289* GNU LGPL:: GNU Lesser General Public License 290* GNU FDL:: GNU Free Documentation License 291 292@end detailmenu 293@end menu 294 295@node Introduction 296@chapter Introduction 297 298This library provides functions for manipulating Unicode strings and 299for manipulating C strings according to the Unicode standard. 300 301It consists of the following parts: 302 303@table @code 304@item <unistr.h> 305elementary string functions 306@item <uniconv.h> 307conversion from/to legacy encodings 308@item <unistdio.h> 309formatted output to strings 310@item <uniname.h> 311character names 312@item <unictype.h> 313character classification and properties 314@item <uniwidth.h> 315string width when using nonproportional fonts 316@item <unigbrk.h> 317grapheme cluster breaks 318@item <uniwbrk.h> 319word breaks 320@item <unilbrk.h> 321line breaking algorithm 322@item <uninorm.h> 323normalization (composition and decomposition) 324@item <unicase.h> 325case folding 326@item <uniregex.h> 327regular expressions (not yet implemented) 328@end table 329 330@cindex use cases 331@cindex value, of libunistring 332libunistring is for you if your application involves non-trivial text 333processing, such as upper/lower case conversions, line breaking, operations 334on words, or more advanced analysis of text. Text provided by the user can, 335in general, contain characters of all kinds of scripts. The text processing 336functions provided by this library handle all scripts and all languages. 337 338libunistring is for you if your application already uses the ISO C / POSIX 339@posixheader{ctype.h}, @posixheader{wctype.h} functions and the text it 340operates on is provided by the user and can be in any language. 341 342libunistring is also for you if your application uses Unicode strings as 343internal in-memory representation. 344 345@menu 346* Unicode:: What is Unicode? 347* Unicode and i18n:: Unicode and internationalization 348* Locale encodings:: What is a locale encoding? 349* In-memory representation:: How to represent strings in memory? 350* char * strings:: What to keep in mind with @code{char *} strings 351* Unicode strings:: How are Unicode strings represented? 352@end menu 353 354@node Unicode 355@section Unicode 356 357@cindex Unicode 358Unicode is a standardized repertoire of characters that contains characters 359from all scripts of the world, from Latin letters to Chinese ideographs 360and Babylonian cuneiform glyphs. It also specifies how these characters 361are to be rendered on a screen or on paper, and how common text processing 362(word selection, line breaking, uppercasing of page titles etc.) is supposed 363to behave on Unicode text. 364 365Unicode also specifies three ways of storing sequences of Unicode 366characters in a computer whose basic unit of data is an 8-bit byte: 367@cindex UTF-8 368@cindex UTF-16 369@cindex UTF-32 370@cindex UCS-4 371@table @asis 372@item UTF-8 373Every character is represented as 1 to 4 bytes. 374@item UTF-16 375Every character is represented as 1 to 2 units of 16 bits. 376@item UTF-32, a.k.a@. UCS-4 377Every character is represented as 1 unit of 32 bits. 378@end table 379 380For encoding Unicode text in a file, UTF-8 is usually used. For encoding 381Unicode strings in memory for a program, either of the three encoding forms 382can be reasonably used. 383 384Unicode is widely used on the web. Prior to the use of Unicode, web pages 385were in many different encodings (ISO-8859-1 for English, French, Spanish, 386ISO-8859-2 for Polish, ISO-8859-7 for Greek, KOI8-R for Russian, GB2312 or 387BIG5 for Chinese, ISO-2022-JP-2 or EUC-JP or Shift_JIS for Japanese, and many 388many others). It was next to impossible to create a document that contained 389Chinese and Polish text in the same document. Due to the many encodings for 390Japanese, even the processing of pure Japanese text was error prone. 391 392References: 393@itemize @bullet 394@item 395The Unicode standard:@texnl{} @url{http://www.unicode.org/} 396@item 397Definition of UTF-8:@texnl{} @url{http://www.rfc-editor.org/rfc/rfc3629.txt} 398@item 399Definition of UTF-16:@texnl{} @url{http://www.rfc-editor.org/rfc/rfc2781.txt} 400@item 401Markus Kuhn's UTF-8 and Unicode FAQ:@texnl{} 402@url{http://www.cl.cam.ac.uk/~mgk25/unicode.html} 403@end itemize 404 405@node Unicode and i18n 406@section Unicode and Internationalization 407 408@cindex internationalization 409Internationalization is the process of changing the source code of a program 410so that it can meet the expectations of users in any culture, if culture 411specific data (translations, images etc.) are provided. 412 413Use of Unicode is not strictly required for internationalization, but it 414makes internationalization much easier, because operations that need to 415look at specific characters (like hyphenation, spell checking, or the 416automatic conversion of double-quotes to opening and closing double-quote 417characters) don't need to consider multiple possible encodings of the text. 418 419Use of Unicode also enables multilingualization: the ability of having text 420in multiple languages present in the same document or even in the same line 421of text. 422 423But use of Unicode is not everything. Internationalization usually consists 424of four features: 425@itemize @bullet 426@item 427Use of Unicode where needed for text processing. This is what this library 428is for. 429@item 430Use of message catalogs for messages shown to the user, This is what 431GNU gettext is about. 432@item 433Use of locale specific conventions for date and time formats, for numeric 434formatting, or for sorting of text. This can be done adequately with the 435POSIX APIs and the implementation of locales in the GNU C library. 436@item 437In graphical user interfaces, adapting the GUI to the default text direction 438of the current locale (see 439@url{https://en.wikipedia.org/wiki/Right-to-left,right-to-left languages}). 440@end itemize 441 442@node Locale encodings 443@section Locale encodings 444 445@cindex locale 446A locale is a set of cultural conventions. According to POSIX, for a program, 447at any moment, there is one locale being designated as the ``current locale''. 448(Actually, POSIX supports also one locale per thread, but this feature is not 449yet universally implemented and not widely used.) 450@cindex locale categories 451The locale is partitioned into several aspects, called the ``categories'' 452of the locale. The main various aspects are: 453@itemize @bullet 454@item 455The character encoding and the character properties. This is the 456@code{LC_CTYPE} category. 457@item 458The sorting rules for text. This is the @code{LC_COLLATE} category. 459@item 460The language specific translations of messages. This is the 461@code{LC_MESSAGES} category. 462@item 463The formatting rules for numbers, such as the decimal separator. This is 464the @code{LC_NUMERIC} category. 465@item 466The formatting rules for amounts of money. This is the @code{LC_MONETARY} 467category. 468@item 469The formatting of date and time. This is the @code{LC_TIME} category. 470@end itemize 471 472@cindex locale encoding 473In particular, the @code{LC_CTYPE} category of the current locale determines 474the character encoding. This is the encoding of @samp{char *} strings. 475We also call it the ``locale encoding''. GNU libunistring has a function, 476@func{locale_charset}, that returns a standardized (platform independent) 477name for this encoding. 478 479All locale encodings used on glibc systems are essentially ASCII compatible: 480Most graphic ASCII characters have the same representation, as a single byte, 481in that encoding as in ASCII. 482 483Among the possible locale encodings are UTF-8 and GB18030. Both allow 484to represent any Unicode character as a sequence of bytes. UTF-8 is used in 485most of the world, whereas GB18030 is used in the People's Republic of China, 486because it is backward compatible with the GB2312 encoding that was used in 487this country earlier. 488 489The legacy locale encodings, ISO-8859-15 (which supplanted ISO-8859-1 in 490most of Europe), ISO-8859-2, KOI8-R, EUC-JP, etc., are still in use in 491some places, though. 492 493UTF-16 and UTF-32 are not used as locale encodings, because they are not 494ASCII compatible. 495 496@node In-memory representation 497@section Choice of in-memory representation of strings 498 499There are three ways of representing strings in memory of a running 500program. 501@itemize @bullet 502@item 503As @samp{char *} strings. Such strings are represented in locale encoding. 504This approach is employed when not much text processing is done by the 505program. When some Unicode aware processing is to be done, a string is 506converted to Unicode on the fly and back to locale encoding afterwards. 507@item 508As UTF-8 or UTF-16 or UTF-32 strings. This implies that conversion from 509locale encoding to Unicode is performed on input, and in the opposite 510direction on output. This approach is employed when the program does 511a significant amount of text processing, or when the program has multiple 512threads operating on the same data but in different locales. 513@item 514As @samp{wchar_t *}, a.k.a@. ``wide strings''. This approach is misguided, 515see @ref{The wchar_t mess}. 516@end itemize 517 518Of course, a @samp{char *} string can, in some cases, be encoded in UTF-8. 519You will use the data type depending on what you can guarantee about how 520it's encoded: If a string is encoded in the locale encoding, or if you 521don't know how it's encoded, use @samp{char *}. If, on the other hand, 522you can @emph{guarantee} that it is UTF-8 encoded, then you can use the 523UTF-8 string type, @code{uint8_t *}, for it. 524 525The five types @code{char *}, @code{uint8_t *}, @code{uint16_t *}, 526@code{uint32_t *}, and @code{wchar_t *} are incompatible types at the C 527level. Therefore, @samp{gcc -Wall} will produce a warning if, by mistake, 528your code contains a mismatch between these types. In the context of 529using GNU libunistring, even a warning about a mismatch between 530@code{char *} and @code{uint8_t *} is a sign of a bug in your code 531that you should not try to silence through a cast. 532 533@node char * strings 534@section @samp{char *} strings 535 536@cindex C string functions 537The classical C strings, with its C library support standardized by 538ISO C and POSIX, can be used in internationalized programs with some 539precautions. The problem with this API is that many of the C library 540functions for strings don't work correctly on strings in locale 541encodings, leading to bugs that only people in some cultures of the 542world will experience. 543 544@cindex locale, multibyte 545The first problem with the C library API is the support of multibyte 546locales. According to the locale encoding, in general, every character 547is represented by one or more bytes (up to 4 bytes in practice --- but 548use @code{MB_LEN_MAX} instead of the number 4 in the code). 549When every character is represented by only 1 byte, we speak of an 550``unibyte locale'', otherwise of a ``multibyte locale''. It is important 551to realize that the majority of Unix installations nowadays use UTF-8 552or GB18030 as locale encoding; therefore, the majority of users are 553using multibyte locales. 554 555@cindex char, type 556The important fact to remember is: 557@cartouche 558@emph{A @samp{char} is a byte, not a character.} 559@end cartouche 560 561As a consequence: 562@itemize @bullet 563@item 564The @posixheader{ctype.h} API is useless in this context; it does not work in 565multibyte locales. 566@item 567The @posixfunc{strlen} function does not return the number of characters 568in a string. Nor does it return the number of screen columns occupied 569by a string after it is output. It merely returns the number of 570@emph{bytes} occupied by a string. 571@item 572Truncating a string, for example, with @posixfunc{strncpy}, can have the 573effect of truncating it in the middle of a multibyte character. Such 574a string will, when output, have a garbled character at its end, often 575represented by a hollow box. 576@item 577@posixfunc{strchr} and @posixfunc{strrchr} do not work with multibyte strings 578if the locale encoding is GB18030 and the character to be searched is 579a digit. 580@item 581@posixfunc{strstr} does not work with multibyte strings if the locale encoding 582is different from UTF-8. 583@item 584@posixfunc{strcspn}, @posixfunc{strpbrk}, @posixfunc{strspn} cannot work 585correctly in multibyte locales: they assume the second argument is a list of 586single-byte characters. Even in this simple case, they do not work with 587multibyte strings if the locale encoding is GB18030 and one of the 588characters to be searched is a digit. 589@item 590@posixfunc{strsep} and @posixfunc{strtok_r} do not work with multibyte strings 591unless all of the delimiter characters are ASCII characters < 0x30. 592@item 593The @posixfunc{strcasecmp}, @posixfunc{strncasecmp}, and @posixfunc{strcasestr} 594functions do not work with multibyte strings. 595@end itemize 596 597The workarounds can be found in GNU gnulib 598@url{http://www.gnu.org/software/gnulib/}. 599@itemize @bullet 600@item 601gnulib has modules @samp{mbchar}, @samp{mbiter}, @samp{mbuiter} that 602represent multibyte characters and allow to iterate across a multibyte 603string with the same ease as through a unibyte string. 604@item 605gnulib has functions @func{mbslen} and @func{mbswidth} that can be 606used instead of @posixfunc{strlen} when the number of characters or the 607number of screen columns of a string is requested. 608@item 609gnulib has functions @func{mbschr} and @func{mbsrrchr} that are 610like @posixfunc{strchr} and @posixfunc{strrchr}, but work in multibyte locales. 611@item 612gnulib has a function @func{mbsstr}, like @posixfunc{strstr}, but works 613in multibyte locales. 614@item 615gnulib has functions @func{mbscspn}, @func{mbspbrk}, @func{mbsspn} 616that are like @posixfunc{strcspn}, @posixfunc{strpbrk}, @posixfunc{strspn}, but 617work in multibyte locales. 618@item 619gnulib has functions @func{mbssep} and @func{mbstok_r} that are 620like @posixfunc{strsep} and @posixfunc{strtok_r} but work in multibyte locales. 621@item 622gnulib has functions @func{mbscasecmp}, @func{mbsncasecmp}, 623@func{mbspcasecmp}, and @func{mbscasestr} that are like @posixfunc{strcasecmp}, 624@posixfunc{strncasecmp}, and @posixfunc{strcasestr}, but 625work in multibyte locales. Still, the function @code{ulc_casecmp} is 626preferable to these functions; see below. 627@end itemize 628 629The second problem with the C library API is that it has some assumptions built-in that are not valid in some languages: 630@itemize @bullet 631@item 632It assumes that there are only two forms of every character: uppercase 633and lowercase. This is not true for Croatian, where the character 634@sc{LETTER DZ WITH CARON} comes in three forms: 635@sc{LATIN CAPITAL LETTER DZ WITH CARON} (DZ), 636@sc{LATIN CAPITAL LETTER D WITH SMALL LETTER Z WITH CARON} (Dz), 637@sc{LATIN SMALL LETTER DZ WITH CARON} (dz). 638@item 639It assumes that uppercasing of 1 character leads to 1 character. This 640is not true for German, where the @sc{LATIN SMALL LETTER SHARP S}, when 641uppercased, becomes @samp{SS}. 642@item 643It assumes that there is 1:1 mapping between uppercase and lowercase forms. 644This is not true for the Greek sigma: @sc{GREEK CAPITAL LETTER SIGMA} is 645the uppercase of both @sc{GREEK SMALL LETTER SIGMA} and 646@sc{GREEK SMALL LETTER FINAL SIGMA}. 647@item 648It assumes that the upper/lowercase mappings are position independent. 649This is not true for the Greek sigma and the Lithuanian i. 650@end itemize 651 652The correct way to deal with this problem is 653@enumerate 654@item 655to provide functions for titlecasing, as well as for upper- and 656lowercasing, 657@item 658to view case transformations as functions that operates on strings, 659rather than on characters. 660@end enumerate 661 662This is implemented in this library, through the functions declared in @code{<unicase.h>}, see @ref{unicase.h}. 663 664@node Unicode strings 665@section Unicode strings 666 667libunistring supports Unicode strings in three representations: 668@cindex UTF-8, strings 669@cindex UTF-16, strings 670@cindex UTF-32, strings 671@itemize @bullet 672@item 673UTF-8 strings, through the type @samp{uint8_t *}. The units are bytes 674(@code{uint8_t}). 675@item 676UTF-16 strings, through the type @samp{uint16_t *}, The units are 16-bit 677memory words (@code{uint16_t}). 678@item 679UTF-32 strings, through the type @samp{uint32_t *}. The units are 32-bit 680memory words (@code{uint32_t}). 681@end itemize 682 683As with C strings, there are two variants: 684@itemize @bullet 685@item 686Unicode strings with a terminating NUL character are represented as 687a pointer to the first unit of the string. There is a unit containing 688a 0 value at the end. It is considered part of the string for all 689memory allocation purposes, but is not considered part of the string 690for all other logical purposes. 691@item 692Unicode strings where embedded NUL characters are allowed. These 693are represented by a pointer to the first unit and the number of units 694(not bytes!) of the string. In this setting, there is no trailing 695zero-valued unit used as ``end marker''. 696@end itemize 697 698@node Conventions 699@chapter Conventions 700 701This chapter explains conventions valid throughout the libunistring library. 702 703@cindex argument conventions 704Variables of type @code{char *} denote C strings in locale encoding. 705See @ref{Locale encodings}. 706 707Variables of type @code{uint8_t *} denote UTF-8 strings. Their units 708are bytes. 709 710Variables of type @code{uint16_t *} denote UTF-16 strings, without byte 711order mark. Their units are 2-byte words. 712 713Variables of type @code{uint32_t *} denote UTF-32 strings, without byte 714order mark. Their units are 4-byte words. 715 716Argument pairs @code{(@var{s}, @var{n})} denote a string 717@code{@var{s}[0..@var{n}-1]} with exactly @var{n} units. 718 719All functions with prefix @samp{ulc_} operate on C strings in locale 720encoding. 721 722All functions with prefix @samp{u8_} operate on UTF-8 strings. 723 724All functions with prefix @samp{u16_} operate on UTF-16 strings. 725 726All functions with prefix @samp{u32_} operate on UTF-32 strings. 727 728For every function with prefix @samp{u8_}, operating on UTF-8 strings, 729there is also a corresponding function with prefix @samp{u16_}, 730operating on UTF-16 strings, and a corresponding function with prefix 731@samp{u32_}, operating on UTF-32 strings. Their description is 732analogous; in this documentation we describe only the function that 733operates on UTF-8 strings, for brevity. 734 735A declaration with a variable @var{n} denotes the three concrete 736declarations with @var{n} = 8, @var{n} = 16, @var{n} = 32. 737 738All parameters starting with @samp{str} and the parameters of 739functions starting with @code{u8_str}/@code{u16_str}/@code{u32_str} 740denote a NUL terminated string. 741 742@cindex return value conventions 743Error values are always returned through the @code{errno} variable, 744usually with a return value that indicates the presence of an error 745(NULL for functions that return an pointer, or -1 for functions that 746return an @code{int}). 747 748Functions returning a string result take a 749@code{(@var{resultbuf}, @var{lengthp})} 750argument pair. If @var{resultbuf} is not NULL and the result fits 751into @code{*@var{lengthp}} units, it is put in @var{resultbuf}, and 752@var{resultbuf} is returned. Otherwise, a freshly allocated string 753is returned. In both cases, @code{*@var{lengthp}} is set to the 754length (number of units) of the returned string. In case of error, 755NULL is returned and @code{errno} is set. 756 757@include unitypes.texi 758@include unistr.texi 759@include uniconv.texi 760@include unistdio.texi 761@include uniname.texi 762@include unictype.texi 763@include uniwidth.texi 764@include unigbrk.texi 765@include uniwbrk.texi 766@include unilbrk.texi 767@include uninorm.texi 768@include unicase.texi 769@include uniregex.texi 770 771@node Using the library 772@chapter Using the library 773 774This chapter explains some practical considerations, regarding the 775installation and compiler options that are needed in order to use this 776library. 777 778@menu 779* Installation:: 780* Compiler options:: 781* Include files:: 782* Autoconf macro:: 783* Reporting problems:: 784@end menu 785 786@node Installation 787@section Installation 788 789@cindex dependencies 790Before you can use the library, it must be installed. First, you have to 791make sure all dependencies are installed. They are listed in the file 792@file{DEPENDENCIES}. 793 794@cindex installation 795Then you can proceed to build and install the library, as described in the 796file @file{INSTALL}. For installation on Windows systems, please refer to 797the file @file{INSTALL.windows}. 798 799@node Compiler options 800@section Compiler options 801 802Let's denote as @code{LIBUNISTRING_PREFIX} the value of the @samp{--prefix} 803option that you passed to @code{configure} while installing this package. 804If you didn't pass any @samp{--prefix} option, then the package is installed 805in @file{/usr/local}. 806 807Let's denote as @code{LIBUNISTRING_INCLUDEDIR} the directory where the 808include files were installed. This is usually the same as 809@code{$@{LIBUNISTRING_PREFIX@}/include}. Except that if you passed an 810@samp{--includedir} option to @code{configure}, it is the value of that 811option. 812 813Let's further denote as @code{LIBUNISTRING_LIBDIR} the directory where 814the library itself was installed. This is the value that you passed 815with the @samp{--libdir} option to @code{configure}, or otherwise the 816same as @code{$@{LIBUNISTRING_PREFIX@}/lib}. Recall that when building 817in 64-bit mode on a 64-bit GNU/Linux system that supports executables 818in either 64-bit mode or 32-bit mode, you should have used the option 819@code{--libdir=$@{LIBUNISTRING_PREFIX@}/lib64}. 820 821@cindex compiler options 822So that the compiler finds the include files, you have to pass it the 823option @code{-I$@{LIBUNISTRING_INCLUDEDIR@}}. 824 825So that the compiler finds the library during its linking pass, you have 826to pass it the options @code{-L$@{LIBUNISTRING_LIBDIR@} -lunistring}. 827On some systems, in some configurations, you also have to pass options 828needed for linking with @code{libiconv}. The autoconf macro 829@code{gl_LIBUNISTRING} (see @ref{Autoconf macro}) deals with this 830particularity. 831 832@node Include files 833@section Include files 834 835Most of the include files have been presented in the introduction, see 836@ref{Introduction}, and subsequent detailed chapters. 837 838Another include file is @code{<unistring/version.h>}. It contains the 839version number of the libunistring library. 840 841@deftypevr Macro int _LIBUNISTRING_VERSION 842This constant contains the version of libunistring that is being used 843at compile time. It encodes the major and minor parts of the version 844number only. These parts are encoded in the form @code{(major<<8) + minor}. 845@end deftypevr 846 847@deftypevr Constant int _libunistring_version 848This constant contains the version of libunistring that is being used 849at run time. It encodes the major and minor parts of the version 850number only. These parts are encoded in the form @code{(major<<8) + minor}. 851@end deftypevr 852 853It is possible that @code{_libunistring_version} is greater than 854@code{_LIBUNISTRING_VERSION}. This can happen when you use 855@code{libunistring} as a shared library, and a newer, binary 856backward-compatible version has been installed after your program 857that uses @code{libunistring} was installed. 858 859@node Autoconf macro 860@section Autoconf macro 861 862@cindex autoconf macro 863GNU Gnulib provides an autoconf macro that tests for the availability 864of @code{libunistring}. It is contained in the Gnulib module 865@samp{libunistring}, see@texnl{} 866@url{http://www.gnu.org/software/gnulib/MODULES.html#module=libunistring}. 867 868@amindex gl_LIBUNISTRING 869The macro is called @code{gl_LIBUNISTRING}. It searches for an installed 870libunistring. If found, it sets and AC_SUBSTs @code{HAVE_LIBUNISTRING=yes} 871and the @code{LIBUNISTRING} and @code{LTLIBUNISTRING} variables and augments 872the @code{CPPFLAGS} variable, and defines the C macro 873@code{HAVE_LIBUNISTRING} to 1. Otherwise, it sets and AC_SUBSTs 874@code{HAVE_LIBUNISTRING=no} and @code{LIBUNISTRING} and @code{LTLIBUNISTRING} 875to empty. 876 877The complexities that @code{gl_LIBUNISTRING} deals with are the following: 878 879@itemize @bullet 880@item 881On some operating systems, in some configurations, libunistring depends 882on @code{libiconv}, and the options for linking with libiconv must be 883mentioned explicitly on the link command line. 884 885@item 886GNU @code{libunistring}, if installed, is not necessarily already in the 887search path (@code{CPPFLAGS} for the include file search path, 888@code{LDFLAGS} for the library search path). 889 890@item 891GNU @code{libunistring}, if installed, is not necessarily already in the 892run time library search path. To avoid the need for setting an environment 893variable like @code{LD_LIBRARY_PATH}, the macro adds the appropriate 894run time search path options to the @code{LIBUNISTRING} variable. This works 895on most systems. 896@end itemize 897 898@node Reporting problems 899@section Reporting problems 900 901@cindex bug reports 902@cindex bug tracker 903@cindex mailing list 904If you encounter any problem, please don't hesitate to send a detailed 905bug report to the @code{bug-libunistring@@gnu.org} mailing list. You can 906alternatively also use the bug tracker at the project page 907@url{https://savannah.gnu.org/projects/libunistring}. 908 909Please always include the version number of this library, and a short 910description of your operating system and compilation environment with 911corresponding version numbers. 912 913For problems that appear while building and installing @code{libunistring}, 914for which you don't find the remedy in the @file{INSTALL} file, please include 915a description of the options that you passed to the @samp{configure} script. 916 917@node More functionality 918@chapter More advanced functionality 919 920@cindex bidirectional reordering 921For bidirectional reordering of strings, we recommend the GNU FriBidi library: 922@url{http://www.fribidi.org/}. 923 924@cindex rendering 925For the rendering of Unicode strings outside of the context of a given toolkit 926(KDE/Qt or GNOME/Gtk), we recommend the Pango library: 927@url{http://www.pango.org/}. 928 929@include wchar_t.texi 930 931@node Licenses 932@appendix Licenses 933@cindex Licenses 934 935The files of this package are covered by the licenses indicated in each 936particular file or directory. Here is a summary: 937 938@itemize @bullet 939@item 940The @code{libunistring} library and its header files are dual-licensed under 941"the GNU LGPLv3+ or the GNU GPLv2". This means, you can use it under either 942@itemize @bullet 943@item @minus{} 944the terms of the GNU Lesser General Public License (LGPL) version 3 or 945(at your option) any later version, or 946@item @minus{} 947the terms of the GNU General Public License (GPL) version 2, or 948@item @minus{} 949the same dual license "the GNU LGPLv3+ or the GNU GPLv2". 950@end itemize 951You find the GNU LGPL version 3 in @ref{GNU LGPL}. This license is 952based on the GNU GPL version 3, see @ref{GNU GPL}. 953@* 954You can find the GNU GPL version 2 at 955@url{https://www.gnu.org/licenses/old-licenses/gpl-2.0.html}. 956@* 957Note: This dual license makes it possible for the @code{libunistring} library 958to be used by packages under GPLv2 or GPLv2+ licenses, in particular. See 959the table in @url{https://www.gnu.org/licenses/gpl-faq.html#AllCompatibility}. 960 961 962@item 963This manual is free documentation. It is dually licensed under the 964GNU FDL and the GNU GPL. This means that you can redistribute this 965manual under either of these two licenses, at your choice. 966@* 967This manual is covered by the GNU FDL. Permission is granted to copy, 968distribute and/or modify this document under the terms of the 969GNU Free Documentation License (FDL), either version 1.2 of the 970License, or (at your option) any later version published by the 971Free Software Foundation (FSF); with no Invariant Sections, with no 972Front-Cover Text, and with no Back-Cover Texts. 973A copy of the license is included in @ref{GNU FDL}. 974@* 975This manual is covered by the GNU GPL. You can redistribute it and/or 976modify it under the terms of the GNU General Public License (GPL), either 977version 3 of the License, or (at your option) any later version published 978by the Free Software Foundation (FSF). 979A copy of the license is included in @ref{GNU GPL}. 980@end itemize 981 982@menu 983* GNU GPL:: GNU General Public License 984* GNU LGPL:: GNU Lesser General Public License 985* GNU FDL:: GNU Free Documentation License 986@end menu 987 988@page 989@node GNU GPL 990@appendixsec GNU GENERAL PUBLIC LICENSE 991@cindex GPL, GNU General Public License 992@cindex License, GNU GPL 993@include gpl.texi 994@page 995@node GNU LGPL 996@appendixsec GNU LESSER GENERAL PUBLIC LICENSE 997@cindex LGPL, GNU Lesser General Public License 998@cindex License, GNU LGPL 999@include lgpl.texi 1000@page 1001@node GNU FDL 1002@appendixsec GNU Free Documentation License 1003@cindex FDL, GNU Free Documentation License 1004@cindex License, GNU FDL 1005@include fdl.texi 1006 1007@node Index 1008@unnumbered Index 1009 1010@printindex cp 1011 1012@bye 1013 1014@c Local Variables: 1015@c indent-tabs-mode: nil 1016@c whitespace-check-buffer-indent: nil 1017@c End: 1018