1@node uninorm.h
2@chapter Normalization forms (composition and decomposition) @code{<uninorm.h>}
3
4@cindex normal forms
5@cindex normalizing
6This include file defines functions for transforming Unicode strings to one
7of the four normal forms, known as NFC, NFD, NKFC, NFKD.  These
8transformations involve decomposition and --- for NFC and NFKC --- composition
9of Unicode characters.
10
11@menu
12* Decomposition of characters::
13* Composition of characters::
14* Normalization of strings::
15* Normalizing comparisons::
16* Normalization of streams::
17@end menu
18
19@node Decomposition of characters
20@section Decomposition of Unicode characters
21
22@cindex decomposing
23The following enumerated values are the possible types of decomposition of a
24Unicode character.
25
26@deftypevr Constant int UC_DECOMP_CANONICAL
27Denotes canonical decomposition.
28@end deftypevr
29
30@deftypevr Constant int UC_DECOMP_FONT
31UCD marker: @code{<font>}.  Denotes a font variant (e.g@. a blackletter form).
32@end deftypevr
33
34@deftypevr Constant int UC_DECOMP_NOBREAK
35UCD marker: @code{<noBreak>}.
36Denotes a no-break version of a space or hyphen.
37@end deftypevr
38
39@deftypevr Constant int UC_DECOMP_INITIAL
40UCD marker: @code{<initial>}.
41Denotes an initial presentation form (Arabic).
42@end deftypevr
43
44@deftypevr Constant int UC_DECOMP_MEDIAL
45UCD marker: @code{<medial>}.
46Denotes a medial presentation form (Arabic).
47@end deftypevr
48
49@deftypevr Constant int UC_DECOMP_FINAL
50UCD marker: @code{<final>}.
51Denotes a final presentation form (Arabic).
52@end deftypevr
53
54@deftypevr Constant int UC_DECOMP_ISOLATED
55UCD marker: @code{<isolated>}.
56Denotes an isolated presentation form (Arabic).
57@end deftypevr
58
59@deftypevr Constant int UC_DECOMP_CIRCLE
60UCD marker: @code{<circle>}.
61Denotes an encircled form.
62@end deftypevr
63
64@deftypevr Constant int UC_DECOMP_SUPER
65UCD marker: @code{<super>}.
66Denotes a superscript form.
67@end deftypevr
68
69@deftypevr Constant int UC_DECOMP_SUB
70UCD marker: @code{<sub>}.
71Denotes a subscript form.
72@end deftypevr
73
74@deftypevr Constant int UC_DECOMP_VERTICAL
75UCD marker: @code{<vertical>}.
76Denotes a vertical layout presentation form.
77@end deftypevr
78
79@deftypevr Constant int UC_DECOMP_WIDE
80UCD marker: @code{<wide>}.
81Denotes a wide (or zenkaku) compatibility character.
82@end deftypevr
83
84@deftypevr Constant int UC_DECOMP_NARROW
85UCD marker: @code{<narrow>}.
86Denotes a narrow (or hankaku) compatibility character.
87@end deftypevr
88
89@deftypevr Constant int UC_DECOMP_SMALL
90UCD marker: @code{<small>}.
91Denotes a small variant form (CNS compatibility).
92@end deftypevr
93
94@deftypevr Constant int UC_DECOMP_SQUARE
95UCD marker: @code{<square>}.
96Denotes a CJK squared font variant.
97@end deftypevr
98
99@deftypevr Constant int UC_DECOMP_FRACTION
100UCD marker: @code{<fraction>}.
101Denotes a vulgar fraction form.
102@end deftypevr
103
104@deftypevr Constant int UC_DECOMP_COMPAT
105UCD marker: @code{<compat>}.
106Denotes an otherwise unspecified compatibility character.
107@end deftypevr
108
109The following constant denotes the maximum size of decomposition of a single
110Unicode character.
111
112@deftypevr Macro {unsigned int} UC_DECOMPOSITION_MAX_LENGTH
113This macro expands to a constant that is the required size of buffer passed to
114the @code{uc_decomposition} and @code{uc_canonical_decomposition} functions.
115@end deftypevr
116
117The following functions decompose a Unicode character.
118
119@deftypefun int uc_decomposition (ucs4_t @var{uc}, int *@var{decomp_tag}, ucs4_t *@var{decomposition})
120Returns the character decomposition mapping of the Unicode character @var{uc}.
121@var{decomposition} must point to an array of at least
122@code{UC_DECOMPOSITION_MAX_LENGTH} @code{ucs_t} elements.
123
124When a decomposition exists, @code{@var{decomposition}[0..@var{n}-1]} and
125@code{*@var{decomp_tag}} are filled and @var{n} is returned.  Otherwise -1 is
126returned.
127@end deftypefun
128
129@deftypefun int uc_canonical_decomposition (ucs4_t @var{uc}, ucs4_t *@var{decomposition})
130Returns the canonical character decomposition mapping of the Unicode character
131@var{uc}.  @var{decomposition} must point to an array of at least
132@code{UC_DECOMPOSITION_MAX_LENGTH} @code{ucs_t} elements.
133
134When a decomposition exists, @code{@var{decomposition}[0..@var{n}-1]} is filled
135and @var{n} is returned.  Otherwise -1 is returned.
136
137Note: This function returns the (simple) ``canonical decomposition'' of
138@var{uc}.  If you want the ``full canonical decomposition'' of @var{uc},
139that is, the recursive application of ``canonical decomposition'', use the
140function @code{u*_normalize} with argument @code{UNINORM_NFD} instead.
141@end deftypefun
142
143@node Composition of characters
144@section Composition of Unicode characters
145
146@cindex composing, Unicode characters
147@cindex combining, Unicode characters
148The following function composes a Unicode character from two Unicode
149characters.
150
151@deftypefun ucs4_t uc_composition (ucs4_t @var{uc1}, ucs4_t @var{uc2})
152Attempts to combine the Unicode characters @var{uc1}, @var{uc2}.
153@var{uc1} is known to have canonical combining class 0.
154
155Returns the combination of @var{uc1} and @var{uc2}, if it exists.
156Returns 0 otherwise.
157
158Not all decompositions can be recombined using this function.  See the Unicode
159file @file{CompositionExclusions.txt} for details.
160@end deftypefun
161
162@node Normalization of strings
163@section Normalization of strings
164
165The Unicode standard defines four normalization forms for Unicode strings.
166The following type is used to denote a normalization form.
167
168@deftp Type uninorm_t
169An object of type @code{uninorm_t} denotes a Unicode normalization form.
170This is a scalar type; its values can be compared with @code{==}.
171@end deftp
172
173The following constants denote the four normalization forms.
174
175@deftypevr Macro uninorm_t UNINORM_NFD
176Denotes Normalization form D: canonical decomposition.
177@end deftypevr
178
179@deftypevr Macro uninorm_t UNINORM_NFC
180Normalization form C: canonical decomposition, then canonical composition.
181@end deftypevr
182
183@deftypevr Macro uninorm_t UNINORM_NFKD
184Normalization form KD: compatibility decomposition.
185@end deftypevr
186
187@deftypevr Macro uninorm_t UNINORM_NFKC
188Normalization form KC: compatibility decomposition, then canonical composition.
189@end deftypevr
190
191The following functions operate on @code{uninorm_t} objects.
192
193@deftypefun bool uninorm_is_compat_decomposing (uninorm_t @var{nf})
194Tests whether the normalization form @var{nf} does compatibility decomposition.
195@end deftypefun
196
197@deftypefun bool uninorm_is_composing (uninorm_t @var{nf})
198Tests whether the normalization form @var{nf} includes canonical composition.
199@end deftypefun
200
201@deftypefun uninorm_t uninorm_decomposing_form (uninorm_t @var{nf})
202Returns the decomposing variant of the normalization form @var{nf}.
203This maps NFC,NFD @arrow{} NFD and NFKC,NFKD @arrow{} NFKD.
204@end deftypefun
205
206The following functions apply a Unicode normalization form to a Unicode string.
207
208@deftypefun {uint8_t *} u8_normalize (uninorm_t @var{nf}, const uint8_t *@var{s}, size_t @var{n}, uint8_t *@var{resultbuf}, size_t *@var{lengthp})
209@deftypefunx {uint16_t *} u16_normalize (uninorm_t @var{nf}, const uint16_t *@var{s}, size_t @var{n}, uint16_t *@var{resultbuf}, size_t *@var{lengthp})
210@deftypefunx {uint32_t *} u32_normalize (uninorm_t @var{nf}, const uint32_t *@var{s}, size_t @var{n}, uint32_t *@var{resultbuf}, size_t *@var{lengthp})
211Returns the specified normalization form of a string.
212
213The @var{resultbuf} and @var{lengthp} arguments are as described in
214chapter @ref{Conventions}.
215@end deftypefun
216
217@node Normalizing comparisons
218@section Normalizing comparisons
219
220@cindex comparing, ignoring normalization
221The following functions compare Unicode string, ignoring differences in
222normalization.
223
224@deftypefun int u8_normcmp (const uint8_t *@var{s1}, size_t @var{n1}, const uint8_t *@var{s2}, size_t @var{n2}, uninorm_t @var{nf}, int *@var{resultp})
225@deftypefunx int u16_normcmp (const uint16_t *@var{s1}, size_t @var{n1}, const uint16_t *@var{s2}, size_t @var{n2}, uninorm_t @var{nf}, int *@var{resultp})
226@deftypefunx int u32_normcmp (const uint32_t *@var{s1}, size_t @var{n1}, const uint32_t *@var{s2}, size_t @var{n2}, uninorm_t @var{nf}, int *@var{resultp})
227Compares @var{s1} and @var{s2}, ignoring differences in normalization.
228
229@var{nf} must be either @code{UNINORM_NFD} or @code{UNINORM_NFKD}.
230
231If successful, sets @code{*@var{resultp}} to -1 if @var{s1} < @var{s2},
2320 if @var{s1} = @var{s2}, 1 if @var{s1} > @var{s2}, and returns 0.
233Upon failure, returns -1 with @code{errno} set.
234@end deftypefun
235
236@cindex comparing, ignoring normalization, with collation rules
237@cindex comparing, with collation rules, ignoring normalization
238@deftypefun {char *} u8_normxfrm (const uint8_t *@var{s}, size_t @var{n}, uninorm_t @var{nf}, char *@var{resultbuf}, size_t *@var{lengthp})
239@deftypefunx {char *} u16_normxfrm (const uint16_t *@var{s}, size_t @var{n}, uninorm_t @var{nf}, char *@var{resultbuf}, size_t *@var{lengthp})
240@deftypefunx {char *} u32_normxfrm (const uint32_t *@var{s}, size_t @var{n}, uninorm_t @var{nf}, char *@var{resultbuf}, size_t *@var{lengthp})
241Converts the string @var{s} of length @var{n} to a NUL-terminated byte
242sequence, in such a way that comparing @code{u8_normxfrm (@var{s1})} and
243@code{u8_normxfrm (@var{s2})} with the @code{u8_cmp2} function is equivalent to
244comparing @var{s1} and @var{s2} with the @code{u8_normcoll} function.
245
246@var{nf} must be either @code{UNINORM_NFC} or @code{UNINORM_NFKC}.
247
248The @var{resultbuf} and @var{lengthp} arguments are as described in
249chapter @ref{Conventions}.
250@end deftypefun
251
252@deftypefun int u8_normcoll (const uint8_t *@var{s1}, size_t @var{n1}, const uint8_t *@var{s2}, size_t @var{n2}, uninorm_t @var{nf}, int *@var{resultp})
253@deftypefunx int u16_normcoll (const uint16_t *@var{s1}, size_t @var{n1}, const uint16_t *@var{s2}, size_t @var{n2}, uninorm_t @var{nf}, int *@var{resultp})
254@deftypefunx int u32_normcoll (const uint32_t *@var{s1}, size_t @var{n1}, const uint32_t *@var{s2}, size_t @var{n2}, uninorm_t @var{nf}, int *@var{resultp})
255Compares @var{s1} and @var{s2}, ignoring differences in normalization, using
256the collation rules of the current locale.
257
258@var{nf} must be either @code{UNINORM_NFC} or @code{UNINORM_NFKC}.
259
260If successful, sets @code{*@var{resultp}} to -1 if @var{s1} < @var{s2},
2610 if @var{s1} = @var{s2}, 1 if @var{s1} > @var{s2}, and returns 0.
262Upon failure, returns -1 with @code{errno} set.
263@end deftypefun
264
265@node Normalization of streams
266@section Normalization of streams of Unicode characters
267
268@cindex stream, normalizing a
269A ``stream of Unicode characters'' is essentially a function that accepts an
270@code{ucs4_t} argument repeatedly, optionally combined with a function that
271``flushes'' the stream.
272
273@deftp Type {struct uninorm_filter}
274This is the data type of a stream of Unicode characters that normalizes its
275input according to a given normalization form and passes the normalized
276character sequence to the encapsulated stream of Unicode characters.
277@end deftp
278
279@deftypefun {struct uninorm_filter *} uninorm_filter_create (uninorm_t @var{nf}, int (*@var{stream_func}) (void *@var{stream_data}, ucs4_t @var{uc}), void *@var{stream_data})
280Creates and returns a normalization filter for Unicode characters.
281
282The pair (@var{stream_func}, @var{stream_data}) is the encapsulated stream.
283@code{@var{stream_func} (@var{stream_data}, @var{uc})} receives the Unicode
284character @var{uc} and returns 0 if successful, or -1 with @code{errno} set
285upon failure.
286
287Returns the new filter, or NULL with @code{errno} set upon failure.
288@end deftypefun
289
290@deftypefun int uninorm_filter_write (struct uninorm_filter *@var{filter}, ucs4_t @var{uc})
291Stuffs a Unicode character into a normalizing filter.
292Returns 0 if successful, or -1 with @code{errno} set upon failure.
293@end deftypefun
294
295@deftypefun int uninorm_filter_flush (struct uninorm_filter *@var{filter})
296Brings data buffered in the filter to its destination, the encapsulated stream.
297
298Returns 0 if successful, or -1 with @code{errno} set upon failure.
299
300Note! If after calling this function, additional characters are written
301into the filter, the resulting character sequence in the encapsulated stream
302will not necessarily be normalized.
303@end deftypefun
304
305@deftypefun int uninorm_filter_free (struct uninorm_filter *@var{filter})
306Brings data buffered in the filter to its destination, the encapsulated stream,
307then closes and frees the filter.
308
309Returns 0 if successful, or -1 with @code{errno} set upon failure.
310@end deftypefun
311