1@node Iconv
2@chapter Character-set conversions (@file{iconv.h})
3
4This chapter describes the Newlib iconv library.
5The iconv functions declarations are in
6@file{iconv.h}.
7
8@menu
9* iconv::                 Character set conversion routines
10* iconv architecture::    Architecture of Newlib iconv library
11* iconv configuration::   Newlib iconv-specific configure options
12* Generating CCS tables:: How to generate CCS tables
13* Adding new converter::  Steps on adding a new converter
14@end menu
15
16@page
17@include iconv/iconv.def
18
19@page
20@node iconv architecture
21@section iconv architecture
22@findex iconv architecture
23@findex encoding
24@findex CCS
25@findex CES
26@findex iconv converter
27@*
28@itemize @bullet
29@item
30Encoding - a rule to represent computer text by means of bits and bytes.
31@item
32CCS (Coded Character Set) - a mapping from an abstract character set
33to a set of non-negative integers (character codes).
34@item
35CES (Character Encoding Scheme) - a mapping from a set of character codes
36units to a sequence of bytes.
37@end itemize
38
39@*
40Examples of CCS: ASCII, ISO-8859-x, KOI8-R, KSX-1001, GB-2312.@*
41Examples of CES: UTF-8, UTF-16, EUC-JP, ISO-2022-JP.
42
43@*
44The iconv library is used to convert an array of characters in one encoding
45to array in another encoding.
46
47@*
48From a user's point of view, the iconv library is a set of converters. Each converter
49corresponds to one encoding (e.g., KOI8-R converter, UTF-8 converter).
50Internally the meaning of converter is different.
51
52@*
53The iconv library always performs conversions through UCS-32: i.e., to convert
54from A to B, iconv library first converts A to UCS-32, and then USC-32 to B.
55
56@*
57Each encoding consists of CES and CCS. CCS may be represented as data tables
58but CES always implies some code (algorithm). Iconv uses CCS tables
59to map from some encoding to UCS-32. CCS tables are placed into
60the iconv/ccs subdirectory of newlib.  The iconv code also uses CES
61modules which can convert some CCS to and from UCS-32.  CES modules are placed
62in the iconv/ces subdirectory.
63
64@*
65Some encodings have CES = CCS (e.g., KOI8-R). For such encodings iconv uses
66special subroutines which perform simple table conversions (ccs_table.c).
67
68@*
69Among specialized CES modules, the iconv library has
70generic support for EUC and ISO-2022-family encodings (ces_euc.c and
71ces_iso2022.c).
72
73@*
74To enable iconv to work with CCS or CES-based encodings, the correspondent
75CES table or CCS module should be linked with Newlib.  The iconv support
76can also load CCS tables dynamically from external files (.cct files from
77iconv/ccs/binary subdirectory).  CES modules, on the other-hand, can't
78be dynamically loaded.
79
80@*
81Each iconv converter has one name and a set of aliases. The list of
82aliases for each converter's name is in the iconv/charset.aliases file.
83Note: iconv always normalizes converter names and aliases before using.
84
85@page
86@node iconv configuration
87@section iconv configuration
88@findex iconv configuration
89@findex iconv converter
90@*
91To enable iconv, the --enable-newlib-iconv configuration option should be
92used when configuring newlib.
93
94@*
95To link a specific converter (CCS table or CES module) into Newlib, the
96---enable-newlib-builtin-converters option should be used.  A
97comma-separated list of converters can be passed with this option
98(e.g., ---enable-newlib-builtin-converters=koi8-r,euc-jp to link KOI8-R
99and EUC-JP converters).  Either converter names or aliases may be used.
100
101@*
102If the target system has a file system accessible by Newlib, table-based
103converters may be loaded dynamically from external files.  The iconv
104code tries to load files from the iconv_data subdirectory of the directory
105specified by the NLSPATH environment variable.
106
107@*
108Since Newlib has no generic dynamic module load support, CES-based converters
109can't be dynamically loaded and should be linked-in.
110
111@page
112@node Generating CCS tables
113@section Generating CCS tables
114@*
115CCS tables are placed in the ccs subdirectory of the iconv directory.
116This subdirectory contains .cct and .c files.  The .cct files are for
117dynamic loading whereas the .c files are for static linking with Newlib.
118Both .c and .cct files are generated by the 'iconv_mktbl' perl script
119from special source files (call them
120.txt files).  The 'iconv_mktbl' script can be found in the iconv/ccs
121subdirectory.  Input .txt files can be found at the Unicode.org site or
122other locations found on the web.
123
124@*
125The .c files are linked with Newlib if the correspondent 'configure' script
126option was given.  This is needed to use iconv on targets without file system
127support.  If a CCS table isn't configured to be linked, the iconv library
128tries to load it dynamically from a corresponding .cct file.
129
130@*
131The following are commands to build .c and .cct CCS table files from .txt
132files for several supported encodings.
133
134@*
135@itemize
136@item
137cp775:@*
138iconv_mktbl -Co cp775.c cp775.txt@*
139iconv_mktbl -o cp775.cct cp775.txt
140@end itemize
141
142@itemize
143@item
144cp850:@*
145iconv_mktbl -Co cp850.c cp850.txt@*
146iconv_mktbl -o cp850.cct cp850.txt
147@end itemize
148
149@itemize
150@item
151cp852:@*
152iconv_mktbl -Co cp852.c cp852.txt@*
153iconv_mktbl -o cp852.cct cp852.txt
154@end itemize
155
156@itemize
157@item
158cp855:@*
159iconv_mktbl -Co cp855.c cp855.txt@*
160iconv_mktbl -o cp855.cct cp855.txt
161@end itemize
162
163@itemize
164@item
165cp866@*
166iconv_mktbl -Co cp866.c cp866.txt@*
167iconv_mktbl -o cp866.cct cp866.txt
168@end itemize
169
170@itemize
171@item
172iso-8859-1@*
173iconv_mktbl -Co iso-8859-1.c iso-8859-1.txt@*
174iconv_mktbl -o iso-8859-1.cct iso-8859-1.txt
175@end itemize
176
177@itemize
178@item
179iso-8859-4@*
180iconv_mktbl -Co iso-8859-4.c iso-8859-4.txt@*
181iconv_mktbl -o iso-8859-4.cct iso-8859-4.txt
182@end itemize
183
184@itemize
185@item
186iso-8859-5@*
187iconv_mktbl -Co iso-8859-5.c iso-8859-5.txt@*
188iconv_mktbl -o iso-8859-5.cct iso-8859-5.txt
189@end itemize
190
191@itemize
192@item
193iso-8859-2@*
194iconv_mktbl -Co iso-8859-2.c iso-8859-2.txt@*
195iconv_mktbl -o iso-8859-2.cct iso-8859-2.txt
196@end itemize
197
198@itemize
199@item
200iso-8859-15@*
201iconv_mktbl -Co iso-8859-15.c iso-8859-15.txt@*
202iconv_mktbl -o iso-8859-15.cct iso-8859-15.txt
203@end itemize
204
205@itemize
206@item
207big5@*
208iconv_mktbl -Co big5.c big5.txt@*
209iconv_mktbl -o big5.cct big5.txt
210@end itemize
211
212@itemize
213@item
214ksx1001@*
215iconv_mktbl -Co ksx1001.c ksx1001.txt@*
216iconv_mktbl -o ksx1001.cct ksx1001.txt
217@end itemize
218
219@itemize
220@item
221gb_2312@*
222iconv_mktbl -Co gb_2312-80.c gb_2312-80.txt@*
223iconv_mktbl -o gb_2312-80.cct gb_2312-80.txt
224@end itemize
225
226@itemize
227@item
228jis_x0201@*
229iconv_mktbl -Co jis_x0201.c jis_x0201.txt@*
230iconv_mktbl -o jis_x0201.cct jis_x0201.txt
231@end itemize
232
233@itemize
234@item
235iconv_mktbl -Co shift_jis.c shift_jis.txt@*
236iconv_mktbl -o shift_jis.cct shift_jis.txt
237@end itemize
238
239@itemize
240@item
241jis_x0208@*
242iconv_mktbl -C -c 1 -u 2 -o jis_x0208-1983.c jis_x0208-1983.txt@*
243iconv_mktbl -c 1 -u 2 -o jis_x0208-1983.cct jis_x0208-1983.txt
244@end itemize
245
246@itemize
247@item
248jis_x0212@*
249iconv_mktbl -Co jis_x0212-1990.c jis_x0212-1990.txt@*
250iconv_mktbl -o jis_x0212-1990.cct jis_x0212-1990.txt
251@end itemize
252
253@itemize
254@item
255cns11643-plane1@*
256iconv_mktbl -C -p 0x1 -o cns11643-plane1.c cns11643.txt@*
257iconv_mktbl -p 0x1 -o cns11643-plane1.cct cns11643.txt
258@end itemize
259
260@itemize
261@item
262cns11643-plane2@*
263iconv_mktbl -C -p 0x2 -o cns11643-plane2.c cns11643.txt@*
264iconv_mktbl -p 0x2 -o cns11643-plane2.cct cns11643.txt
265@end itemize
266
267@itemize
268@item
269cns11643-plane14@*
270iconv_mktbl -C -p 0xE -o cns11643-plane14.c cns11643.txt@*
271iconv_mktbl -p 0xE -o cns11643-plane14.cct cns11643.txt
272@end itemize
273
274@itemize
275@item
276koi8-r@*
277iconv_mktbl -Co koi8-r.c koi8-r.txt@*
278iconv_mktbl -o koi8-r.cct koi8-r.txt
279@end itemize
280
281@itemize
282@item
283koi8-u@*
284iconv_mktbl -Co koi8-u.c koi8-u.txt@*
285iconv_mktbl -o koi8-u.cct koi8-u.txt
286@end itemize
287
288@itemize
289@item
290us-ascii@*
291iconv_mktbl -Cao us-ascii.c iso-8859-1.txt@*
292iconv_mktbl -ao us-ascii.cct iso-8859-1.txt
293@end itemize
294
295@*
296Source files for CCS tables can be taken from at least two places:
297
298@*
299@enumerate
300@item
301http://www.unicode.org/Public/MAPPINGS/ contains a lot of encoding
302map files.
303@item
304http://www.dante.net/staff/konstantin/FreeBSD/iconv/ contains original
305iconv sources and encoding map files.
306@end enumerate
307
308@*
309The following are URLs where source files for some of the CCS tables
310are found:
311
312@itemize
313@item
314big5:@*
315http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/OTHER/BIG5.TXT
316@end itemize
317
318@itemize
319@item
320cns11643_plane14, cns11643_plane1 and cns11643_plane2:@*
321http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/OTHER/CNS11643.TXT
322@end itemize
323
324@itemize
325@item
326cp775, cp850, cp852, cp855, cp866:@*
327http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/PC/
328@end itemize
329
330@itemize
331@item
332gb_2312_80:@*
333http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/GB/GB2312.TXT
334@end itemize
335
336@itemize
337@item
338iso_8859_15, iso_8859_1, iso_8859_2, iso_8859_4, iso_8859_5:@*
339http://www.unicode.org/Public/MAPPINGS/ISO8859/
340@end itemize
341
342@itemize
343@item
344jis_x0201, jis_x0208_1983, jis_x0212_1990, shift_jis@*
345http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/JIS/JIS0201.TXT
346@end itemize
347
348@itemize
349@item
350koi8_r@*
351http://www.unicode.org/Public/MAPPINGS/VENDORS/MISC/KOI8-R.TXT
352@end itemize
353
354@itemize
355@item
356ksx1001@*
357http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/KSC/KSX1001.TXT
358@end itemize
359
360@itemize
361@item
362koi8-u can be given from original FreeBSD iconv library distribution
363http://www.dante.net/staff/konstantin/FreeBSD/iconv/
364@end itemize
365
366@*
367Moreover, http://www.dante.net/staff/konstantin/FreeBSD/iconv/ contains a
368lot of additional CCS tables that you can use with Newlib (iso-2022 and
369RFC1345 encodings).
370
371@page
372@node Adding new converter
373@section Adding a new iconv converter
374@*
375The following steps should be taken to add a new iconv converter:
376
377@*
378@enumerate
379@item
380Converter's name and aliases list should be added to
381the iconv/charset.aliases file
382@item
383All iconv converters are protected by a _ICONV_CONVERTER_XXX
384macro, where XXX is converter name.  This protection macro should be added to
385newlib/newlib.hin file.
386@item
387Converter's name and aliases should be also registered in _iconv_builtin_aliases
388table in iconv/lib/bialiasesi.c.  The list should be protected by
389the corresponding macro mentioned above.
390@item
391If a new converter is just a CCS table, the corresponding .cct and .c files
392should be added to the iconv/ccs/ subdirectory. The name of the files
393should be equivalent to the normalized encoding name.  The 'iconv_mktbl'
394Perl script (found in iconv/ccs) may
395be used to generate such files.  The file's name should be added to
396iconv/ccs/Makefile.am and iconv/ccs/binary/Makefile.am files and then
397automake should be used to regenerate the Makefile.in files.
398@item
399If a new converter has a CES algorithm, the appropriate file should be
400added to the
401iconv/ces/ subdirectory.  The name of the file again should be equivalent
402to the normalized
403encoding name.
404@item
405If a converter is EUC or ISO-2022-family CES, then the converter
406is just an array with a list of used CCS (See ccs/euc-jp.c for example). This
407is because iconv already has EUC and ISO-2022 support.  Used CCS tables should
408be provided in iconv/ccs/.
409@item
410If a converter isn't EUC or ISO-2022-based CCS, the following two functions
411should be provided (see utf-8.c for example):
412@enumerate -
413@item A function to convert from new CES to UCS-32;
414@item A function to convert from UCS-32 to new CES;
415@item An 'init' function;
416@item A 'close' function;
417@item A 'reset' function to reset shift state for stateful CES.
418@end enumerate
419
420@*
421All these functions are registered into a 'struct iconv_ces_desc' object.
422The name of the object should be _iconv_ces_module_XXX, where XXX is the
423name of the converter.
424@item
425For CES converters the correspondent 'struct iconv_ces_desc' reference should
426be added into iconv/lib/bices.c file.
427
428@*
429For CCS converters, the corresponding table reference should be added into
430the iconv/lib/biccs.c file.
431@end enumerate
432
433