1@node uninorm.h 2@chapter Normalization forms (composition and decomposition) @code{<uninorm.h>} 3 4@cindex normal forms 5@cindex normalizing 6This include file defines functions for transforming Unicode strings to one 7of the four normal forms, known as NFC, NFD, NKFC, NFKD. These 8transformations involve decomposition and --- for NFC and NFKC --- composition 9of Unicode characters. 10 11@menu 12* Decomposition of characters:: 13* Composition of characters:: 14* Normalization of strings:: 15* Normalizing comparisons:: 16* Normalization of streams:: 17@end menu 18 19@node Decomposition of characters 20@section Decomposition of Unicode characters 21 22@cindex decomposing 23The following enumerated values are the possible types of decomposition of a 24Unicode character. 25 26@deftypevr Constant int UC_DECOMP_CANONICAL 27Denotes canonical decomposition. 28@end deftypevr 29 30@deftypevr Constant int UC_DECOMP_FONT 31UCD marker: @code{<font>}. Denotes a font variant (e.g@. a blackletter form). 32@end deftypevr 33 34@deftypevr Constant int UC_DECOMP_NOBREAK 35UCD marker: @code{<noBreak>}. 36Denotes a no-break version of a space or hyphen. 37@end deftypevr 38 39@deftypevr Constant int UC_DECOMP_INITIAL 40UCD marker: @code{<initial>}. 41Denotes an initial presentation form (Arabic). 42@end deftypevr 43 44@deftypevr Constant int UC_DECOMP_MEDIAL 45UCD marker: @code{<medial>}. 46Denotes a medial presentation form (Arabic). 47@end deftypevr 48 49@deftypevr Constant int UC_DECOMP_FINAL 50UCD marker: @code{<final>}. 51Denotes a final presentation form (Arabic). 52@end deftypevr 53 54@deftypevr Constant int UC_DECOMP_ISOLATED 55UCD marker: @code{<isolated>}. 56Denotes an isolated presentation form (Arabic). 57@end deftypevr 58 59@deftypevr Constant int UC_DECOMP_CIRCLE 60UCD marker: @code{<circle>}. 61Denotes an encircled form. 62@end deftypevr 63 64@deftypevr Constant int UC_DECOMP_SUPER 65UCD marker: @code{<super>}. 66Denotes a superscript form. 67@end deftypevr 68 69@deftypevr Constant int UC_DECOMP_SUB 70UCD marker: @code{<sub>}. 71Denotes a subscript form. 72@end deftypevr 73 74@deftypevr Constant int UC_DECOMP_VERTICAL 75UCD marker: @code{<vertical>}. 76Denotes a vertical layout presentation form. 77@end deftypevr 78 79@deftypevr Constant int UC_DECOMP_WIDE 80UCD marker: @code{<wide>}. 81Denotes a wide (or zenkaku) compatibility character. 82@end deftypevr 83 84@deftypevr Constant int UC_DECOMP_NARROW 85UCD marker: @code{<narrow>}. 86Denotes a narrow (or hankaku) compatibility character. 87@end deftypevr 88 89@deftypevr Constant int UC_DECOMP_SMALL 90UCD marker: @code{<small>}. 91Denotes a small variant form (CNS compatibility). 92@end deftypevr 93 94@deftypevr Constant int UC_DECOMP_SQUARE 95UCD marker: @code{<square>}. 96Denotes a CJK squared font variant. 97@end deftypevr 98 99@deftypevr Constant int UC_DECOMP_FRACTION 100UCD marker: @code{<fraction>}. 101Denotes a vulgar fraction form. 102@end deftypevr 103 104@deftypevr Constant int UC_DECOMP_COMPAT 105UCD marker: @code{<compat>}. 106Denotes an otherwise unspecified compatibility character. 107@end deftypevr 108 109The following constant denotes the maximum size of decomposition of a single 110Unicode character. 111 112@deftypevr Macro {unsigned int} UC_DECOMPOSITION_MAX_LENGTH 113This macro expands to a constant that is the required size of buffer passed to 114the @code{uc_decomposition} and @code{uc_canonical_decomposition} functions. 115@end deftypevr 116 117The following functions decompose a Unicode character. 118 119@deftypefun int uc_decomposition (ucs4_t @var{uc}, int *@var{decomp_tag}, ucs4_t *@var{decomposition}) 120Returns the character decomposition mapping of the Unicode character @var{uc}. 121@var{decomposition} must point to an array of at least 122@code{UC_DECOMPOSITION_MAX_LENGTH} @code{ucs_t} elements. 123 124When a decomposition exists, @code{@var{decomposition}[0..@var{n}-1]} and 125@code{*@var{decomp_tag}} are filled and @var{n} is returned. Otherwise -1 is 126returned. 127@end deftypefun 128 129@deftypefun int uc_canonical_decomposition (ucs4_t @var{uc}, ucs4_t *@var{decomposition}) 130Returns the canonical character decomposition mapping of the Unicode character 131@var{uc}. @var{decomposition} must point to an array of at least 132@code{UC_DECOMPOSITION_MAX_LENGTH} @code{ucs_t} elements. 133 134When a decomposition exists, @code{@var{decomposition}[0..@var{n}-1]} is filled 135and @var{n} is returned. Otherwise -1 is returned. 136 137Note: This function returns the (simple) ``canonical decomposition'' of 138@var{uc}. If you want the ``full canonical decomposition'' of @var{uc}, 139that is, the recursive application of ``canonical decomposition'', use the 140function @code{u*_normalize} with argument @code{UNINORM_NFD} instead. 141@end deftypefun 142 143@node Composition of characters 144@section Composition of Unicode characters 145 146@cindex composing, Unicode characters 147@cindex combining, Unicode characters 148The following function composes a Unicode character from two Unicode 149characters. 150 151@deftypefun ucs4_t uc_composition (ucs4_t @var{uc1}, ucs4_t @var{uc2}) 152Attempts to combine the Unicode characters @var{uc1}, @var{uc2}. 153@var{uc1} is known to have canonical combining class 0. 154 155Returns the combination of @var{uc1} and @var{uc2}, if it exists. 156Returns 0 otherwise. 157 158Not all decompositions can be recombined using this function. See the Unicode 159file @file{CompositionExclusions.txt} for details. 160@end deftypefun 161 162@node Normalization of strings 163@section Normalization of strings 164 165The Unicode standard defines four normalization forms for Unicode strings. 166The following type is used to denote a normalization form. 167 168@deftp Type uninorm_t 169An object of type @code{uninorm_t} denotes a Unicode normalization form. 170This is a scalar type; its values can be compared with @code{==}. 171@end deftp 172 173The following constants denote the four normalization forms. 174 175@deftypevr Macro uninorm_t UNINORM_NFD 176Denotes Normalization form D: canonical decomposition. 177@end deftypevr 178 179@deftypevr Macro uninorm_t UNINORM_NFC 180Normalization form C: canonical decomposition, then canonical composition. 181@end deftypevr 182 183@deftypevr Macro uninorm_t UNINORM_NFKD 184Normalization form KD: compatibility decomposition. 185@end deftypevr 186 187@deftypevr Macro uninorm_t UNINORM_NFKC 188Normalization form KC: compatibility decomposition, then canonical composition. 189@end deftypevr 190 191The following functions operate on @code{uninorm_t} objects. 192 193@deftypefun bool uninorm_is_compat_decomposing (uninorm_t @var{nf}) 194Tests whether the normalization form @var{nf} does compatibility decomposition. 195@end deftypefun 196 197@deftypefun bool uninorm_is_composing (uninorm_t @var{nf}) 198Tests whether the normalization form @var{nf} includes canonical composition. 199@end deftypefun 200 201@deftypefun uninorm_t uninorm_decomposing_form (uninorm_t @var{nf}) 202Returns the decomposing variant of the normalization form @var{nf}. 203This maps NFC,NFD @arrow{} NFD and NFKC,NFKD @arrow{} NFKD. 204@end deftypefun 205 206The following functions apply a Unicode normalization form to a Unicode string. 207 208@deftypefun {uint8_t *} u8_normalize (uninorm_t @var{nf}, const uint8_t *@var{s}, size_t @var{n}, uint8_t *@var{resultbuf}, size_t *@var{lengthp}) 209@deftypefunx {uint16_t *} u16_normalize (uninorm_t @var{nf}, const uint16_t *@var{s}, size_t @var{n}, uint16_t *@var{resultbuf}, size_t *@var{lengthp}) 210@deftypefunx {uint32_t *} u32_normalize (uninorm_t @var{nf}, const uint32_t *@var{s}, size_t @var{n}, uint32_t *@var{resultbuf}, size_t *@var{lengthp}) 211Returns the specified normalization form of a string. 212 213The @var{resultbuf} and @var{lengthp} arguments are as described in 214chapter @ref{Conventions}. 215@end deftypefun 216 217@node Normalizing comparisons 218@section Normalizing comparisons 219 220@cindex comparing, ignoring normalization 221The following functions compare Unicode string, ignoring differences in 222normalization. 223 224@deftypefun int u8_normcmp (const uint8_t *@var{s1}, size_t @var{n1}, const uint8_t *@var{s2}, size_t @var{n2}, uninorm_t @var{nf}, int *@var{resultp}) 225@deftypefunx int u16_normcmp (const uint16_t *@var{s1}, size_t @var{n1}, const uint16_t *@var{s2}, size_t @var{n2}, uninorm_t @var{nf}, int *@var{resultp}) 226@deftypefunx int u32_normcmp (const uint32_t *@var{s1}, size_t @var{n1}, const uint32_t *@var{s2}, size_t @var{n2}, uninorm_t @var{nf}, int *@var{resultp}) 227Compares @var{s1} and @var{s2}, ignoring differences in normalization. 228 229@var{nf} must be either @code{UNINORM_NFD} or @code{UNINORM_NFKD}. 230 231If successful, sets @code{*@var{resultp}} to -1 if @var{s1} < @var{s2}, 2320 if @var{s1} = @var{s2}, 1 if @var{s1} > @var{s2}, and returns 0. 233Upon failure, returns -1 with @code{errno} set. 234@end deftypefun 235 236@cindex comparing, ignoring normalization, with collation rules 237@cindex comparing, with collation rules, ignoring normalization 238@deftypefun {char *} u8_normxfrm (const uint8_t *@var{s}, size_t @var{n}, uninorm_t @var{nf}, char *@var{resultbuf}, size_t *@var{lengthp}) 239@deftypefunx {char *} u16_normxfrm (const uint16_t *@var{s}, size_t @var{n}, uninorm_t @var{nf}, char *@var{resultbuf}, size_t *@var{lengthp}) 240@deftypefunx {char *} u32_normxfrm (const uint32_t *@var{s}, size_t @var{n}, uninorm_t @var{nf}, char *@var{resultbuf}, size_t *@var{lengthp}) 241Converts the string @var{s} of length @var{n} to a NUL-terminated byte 242sequence, in such a way that comparing @code{u8_normxfrm (@var{s1})} and 243@code{u8_normxfrm (@var{s2})} with the @code{u8_cmp2} function is equivalent to 244comparing @var{s1} and @var{s2} with the @code{u8_normcoll} function. 245 246@var{nf} must be either @code{UNINORM_NFC} or @code{UNINORM_NFKC}. 247 248The @var{resultbuf} and @var{lengthp} arguments are as described in 249chapter @ref{Conventions}. 250@end deftypefun 251 252@deftypefun int u8_normcoll (const uint8_t *@var{s1}, size_t @var{n1}, const uint8_t *@var{s2}, size_t @var{n2}, uninorm_t @var{nf}, int *@var{resultp}) 253@deftypefunx int u16_normcoll (const uint16_t *@var{s1}, size_t @var{n1}, const uint16_t *@var{s2}, size_t @var{n2}, uninorm_t @var{nf}, int *@var{resultp}) 254@deftypefunx int u32_normcoll (const uint32_t *@var{s1}, size_t @var{n1}, const uint32_t *@var{s2}, size_t @var{n2}, uninorm_t @var{nf}, int *@var{resultp}) 255Compares @var{s1} and @var{s2}, ignoring differences in normalization, using 256the collation rules of the current locale. 257 258@var{nf} must be either @code{UNINORM_NFC} or @code{UNINORM_NFKC}. 259 260If successful, sets @code{*@var{resultp}} to -1 if @var{s1} < @var{s2}, 2610 if @var{s1} = @var{s2}, 1 if @var{s1} > @var{s2}, and returns 0. 262Upon failure, returns -1 with @code{errno} set. 263@end deftypefun 264 265@node Normalization of streams 266@section Normalization of streams of Unicode characters 267 268@cindex stream, normalizing a 269A ``stream of Unicode characters'' is essentially a function that accepts an 270@code{ucs4_t} argument repeatedly, optionally combined with a function that 271``flushes'' the stream. 272 273@deftp Type {struct uninorm_filter} 274This is the data type of a stream of Unicode characters that normalizes its 275input according to a given normalization form and passes the normalized 276character sequence to the encapsulated stream of Unicode characters. 277@end deftp 278 279@deftypefun {struct uninorm_filter *} uninorm_filter_create (uninorm_t @var{nf}, int (*@var{stream_func}) (void *@var{stream_data}, ucs4_t @var{uc}), void *@var{stream_data}) 280Creates and returns a normalization filter for Unicode characters. 281 282The pair (@var{stream_func}, @var{stream_data}) is the encapsulated stream. 283@code{@var{stream_func} (@var{stream_data}, @var{uc})} receives the Unicode 284character @var{uc} and returns 0 if successful, or -1 with @code{errno} set 285upon failure. 286 287Returns the new filter, or NULL with @code{errno} set upon failure. 288@end deftypefun 289 290@deftypefun int uninorm_filter_write (struct uninorm_filter *@var{filter}, ucs4_t @var{uc}) 291Stuffs a Unicode character into a normalizing filter. 292Returns 0 if successful, or -1 with @code{errno} set upon failure. 293@end deftypefun 294 295@deftypefun int uninorm_filter_flush (struct uninorm_filter *@var{filter}) 296Brings data buffered in the filter to its destination, the encapsulated stream. 297 298Returns 0 if successful, or -1 with @code{errno} set upon failure. 299 300Note! If after calling this function, additional characters are written 301into the filter, the resulting character sequence in the encapsulated stream 302will not necessarily be normalized. 303@end deftypefun 304 305@deftypefun int uninorm_filter_free (struct uninorm_filter *@var{filter}) 306Brings data buffered in the filter to its destination, the encapsulated stream, 307then closes and frees the filter. 308 309Returns 0 if successful, or -1 with @code{errno} set upon failure. 310@end deftypefun 311