1@node Iconv 2@chapter Character-set conversions (@file{iconv.h}) 3 4This chapter describes the Newlib iconv library. 5The iconv functions declarations are in 6@file{iconv.h}. 7 8@menu 9* iconv:: Character set conversion routines 10* iconv architecture:: Architecture of Newlib iconv library 11* iconv configuration:: Newlib iconv-specific configure options 12* Generating CCS tables:: How to generate CCS tables 13* Adding new converter:: Steps on adding a new converter 14@end menu 15 16@page 17@include iconv/iconv.def 18 19@page 20@node iconv architecture 21@section iconv architecture 22@findex iconv architecture 23@findex encoding 24@findex CCS 25@findex CES 26@findex iconv converter 27@* 28@itemize @bullet 29@item 30Encoding - a rule to represent computer text by means of bits and bytes. 31@item 32CCS (Coded Character Set) - a mapping from an abstract character set 33to a set of non-negative integers (character codes). 34@item 35CES (Character Encoding Scheme) - a mapping from a set of character codes 36units to a sequence of bytes. 37@end itemize 38 39@* 40Examples of CCS: ASCII, ISO-8859-x, KOI8-R, KSX-1001, GB-2312.@* 41Examples of CES: UTF-8, UTF-16, EUC-JP, ISO-2022-JP. 42 43@* 44The iconv library is used to convert an array of characters in one encoding 45to array in another encoding. 46 47@* 48From a user's point of view, the iconv library is a set of converters. Each converter 49corresponds to one encoding (e.g., KOI8-R converter, UTF-8 converter). 50Internally the meaning of converter is different. 51 52@* 53The iconv library always performs conversions through UCS-32: i.e., to convert 54from A to B, iconv library first converts A to UCS-32, and then USC-32 to B. 55 56@* 57Each encoding consists of CES and CCS. CCS may be represented as data tables 58but CES always implies some code (algorithm). Iconv uses CCS tables 59to map from some encoding to UCS-32. CCS tables are placed into 60the iconv/ccs subdirectory of newlib. The iconv code also uses CES 61modules which can convert some CCS to and from UCS-32. CES modules are placed 62in the iconv/ces subdirectory. 63 64@* 65Some encodings have CES = CCS (e.g., KOI8-R). For such encodings iconv uses 66special subroutines which perform simple table conversions (ccs_table.c). 67 68@* 69Among specialized CES modules, the iconv library has 70generic support for EUC and ISO-2022-family encodings (ces_euc.c and 71ces_iso2022.c). 72 73@* 74To enable iconv to work with CCS or CES-based encodings, the correspondent 75CES table or CCS module should be linked with Newlib. The iconv support 76can also load CCS tables dynamically from external files (.cct files from 77iconv/ccs/binary subdirectory). CES modules, on the other-hand, can't 78be dynamically loaded. 79 80@* 81Each iconv converter has one name and a set of aliases. The list of 82aliases for each converter's name is in the iconv/charset.aliases file. 83Note: iconv always normalizes converter names and aliases before using. 84 85@page 86@node iconv configuration 87@section iconv configuration 88@findex iconv configuration 89@findex iconv converter 90@* 91To enable iconv, the --enable-newlib-iconv configuration option should be 92used when configuring newlib. 93 94@* 95To link a specific converter (CCS table or CES module) into Newlib, the 96---enable-newlib-builtin-converters option should be used. A 97comma-separated list of converters can be passed with this option 98(e.g., ---enable-newlib-builtin-converters=koi8-r,euc-jp to link KOI8-R 99and EUC-JP converters). Either converter names or aliases may be used. 100 101@* 102If the target system has a file system accessible by Newlib, table-based 103converters may be loaded dynamically from external files. The iconv 104code tries to load files from the iconv_data subdirectory of the directory 105specified by the NLSPATH environment variable. 106 107@* 108Since Newlib has no generic dynamic module load support, CES-based converters 109can't be dynamically loaded and should be linked-in. 110 111@page 112@node Generating CCS tables 113@section Generating CCS tables 114@* 115CCS tables are placed in the ccs subdirectory of the iconv directory. 116This subdirectory contains .cct and .c files. The .cct files are for 117dynamic loading whereas the .c files are for static linking with Newlib. 118Both .c and .cct files are generated by the 'iconv_mktbl' perl script 119from special source files (call them 120.txt files). The 'iconv_mktbl' script can be found in the iconv/ccs 121subdirectory. Input .txt files can be found at the Unicode.org site or 122other locations found on the web. 123 124@* 125The .c files are linked with Newlib if the correspondent 'configure' script 126option was given. This is needed to use iconv on targets without file system 127support. If a CCS table isn't configured to be linked, the iconv library 128tries to load it dynamically from a corresponding .cct file. 129 130@* 131The following are commands to build .c and .cct CCS table files from .txt 132files for several supported encodings. 133 134@* 135@itemize 136@item 137cp775:@* 138iconv_mktbl -Co cp775.c cp775.txt@* 139iconv_mktbl -o cp775.cct cp775.txt 140@end itemize 141 142@itemize 143@item 144cp850:@* 145iconv_mktbl -Co cp850.c cp850.txt@* 146iconv_mktbl -o cp850.cct cp850.txt 147@end itemize 148 149@itemize 150@item 151cp852:@* 152iconv_mktbl -Co cp852.c cp852.txt@* 153iconv_mktbl -o cp852.cct cp852.txt 154@end itemize 155 156@itemize 157@item 158cp855:@* 159iconv_mktbl -Co cp855.c cp855.txt@* 160iconv_mktbl -o cp855.cct cp855.txt 161@end itemize 162 163@itemize 164@item 165cp866@* 166iconv_mktbl -Co cp866.c cp866.txt@* 167iconv_mktbl -o cp866.cct cp866.txt 168@end itemize 169 170@itemize 171@item 172iso-8859-1@* 173iconv_mktbl -Co iso-8859-1.c iso-8859-1.txt@* 174iconv_mktbl -o iso-8859-1.cct iso-8859-1.txt 175@end itemize 176 177@itemize 178@item 179iso-8859-4@* 180iconv_mktbl -Co iso-8859-4.c iso-8859-4.txt@* 181iconv_mktbl -o iso-8859-4.cct iso-8859-4.txt 182@end itemize 183 184@itemize 185@item 186iso-8859-5@* 187iconv_mktbl -Co iso-8859-5.c iso-8859-5.txt@* 188iconv_mktbl -o iso-8859-5.cct iso-8859-5.txt 189@end itemize 190 191@itemize 192@item 193iso-8859-2@* 194iconv_mktbl -Co iso-8859-2.c iso-8859-2.txt@* 195iconv_mktbl -o iso-8859-2.cct iso-8859-2.txt 196@end itemize 197 198@itemize 199@item 200iso-8859-15@* 201iconv_mktbl -Co iso-8859-15.c iso-8859-15.txt@* 202iconv_mktbl -o iso-8859-15.cct iso-8859-15.txt 203@end itemize 204 205@itemize 206@item 207big5@* 208iconv_mktbl -Co big5.c big5.txt@* 209iconv_mktbl -o big5.cct big5.txt 210@end itemize 211 212@itemize 213@item 214ksx1001@* 215iconv_mktbl -Co ksx1001.c ksx1001.txt@* 216iconv_mktbl -o ksx1001.cct ksx1001.txt 217@end itemize 218 219@itemize 220@item 221gb_2312@* 222iconv_mktbl -Co gb_2312-80.c gb_2312-80.txt@* 223iconv_mktbl -o gb_2312-80.cct gb_2312-80.txt 224@end itemize 225 226@itemize 227@item 228jis_x0201@* 229iconv_mktbl -Co jis_x0201.c jis_x0201.txt@* 230iconv_mktbl -o jis_x0201.cct jis_x0201.txt 231@end itemize 232 233@itemize 234@item 235iconv_mktbl -Co shift_jis.c shift_jis.txt@* 236iconv_mktbl -o shift_jis.cct shift_jis.txt 237@end itemize 238 239@itemize 240@item 241jis_x0208@* 242iconv_mktbl -C -c 1 -u 2 -o jis_x0208-1983.c jis_x0208-1983.txt@* 243iconv_mktbl -c 1 -u 2 -o jis_x0208-1983.cct jis_x0208-1983.txt 244@end itemize 245 246@itemize 247@item 248jis_x0212@* 249iconv_mktbl -Co jis_x0212-1990.c jis_x0212-1990.txt@* 250iconv_mktbl -o jis_x0212-1990.cct jis_x0212-1990.txt 251@end itemize 252 253@itemize 254@item 255cns11643-plane1@* 256iconv_mktbl -C -p 0x1 -o cns11643-plane1.c cns11643.txt@* 257iconv_mktbl -p 0x1 -o cns11643-plane1.cct cns11643.txt 258@end itemize 259 260@itemize 261@item 262cns11643-plane2@* 263iconv_mktbl -C -p 0x2 -o cns11643-plane2.c cns11643.txt@* 264iconv_mktbl -p 0x2 -o cns11643-plane2.cct cns11643.txt 265@end itemize 266 267@itemize 268@item 269cns11643-plane14@* 270iconv_mktbl -C -p 0xE -o cns11643-plane14.c cns11643.txt@* 271iconv_mktbl -p 0xE -o cns11643-plane14.cct cns11643.txt 272@end itemize 273 274@itemize 275@item 276koi8-r@* 277iconv_mktbl -Co koi8-r.c koi8-r.txt@* 278iconv_mktbl -o koi8-r.cct koi8-r.txt 279@end itemize 280 281@itemize 282@item 283koi8-u@* 284iconv_mktbl -Co koi8-u.c koi8-u.txt@* 285iconv_mktbl -o koi8-u.cct koi8-u.txt 286@end itemize 287 288@itemize 289@item 290us-ascii@* 291iconv_mktbl -Cao us-ascii.c iso-8859-1.txt@* 292iconv_mktbl -ao us-ascii.cct iso-8859-1.txt 293@end itemize 294 295@* 296Source files for CCS tables can be taken from at least two places: 297 298@* 299@enumerate 300@item 301http://www.unicode.org/Public/MAPPINGS/ contains a lot of encoding 302map files. 303@item 304http://www.dante.net/staff/konstantin/FreeBSD/iconv/ contains original 305iconv sources and encoding map files. 306@end enumerate 307 308@* 309The following are URLs where source files for some of the CCS tables 310are found: 311 312@itemize 313@item 314big5:@* 315http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/OTHER/BIG5.TXT 316@end itemize 317 318@itemize 319@item 320cns11643_plane14, cns11643_plane1 and cns11643_plane2:@* 321http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/OTHER/CNS11643.TXT 322@end itemize 323 324@itemize 325@item 326cp775, cp850, cp852, cp855, cp866:@* 327http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/PC/ 328@end itemize 329 330@itemize 331@item 332gb_2312_80:@* 333http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/GB/GB2312.TXT 334@end itemize 335 336@itemize 337@item 338iso_8859_15, iso_8859_1, iso_8859_2, iso_8859_4, iso_8859_5:@* 339http://www.unicode.org/Public/MAPPINGS/ISO8859/ 340@end itemize 341 342@itemize 343@item 344jis_x0201, jis_x0208_1983, jis_x0212_1990, shift_jis@* 345http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/JIS/JIS0201.TXT 346@end itemize 347 348@itemize 349@item 350koi8_r@* 351http://www.unicode.org/Public/MAPPINGS/VENDORS/MISC/KOI8-R.TXT 352@end itemize 353 354@itemize 355@item 356ksx1001@* 357http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/KSC/KSX1001.TXT 358@end itemize 359 360@itemize 361@item 362koi8-u can be given from original FreeBSD iconv library distribution 363http://www.dante.net/staff/konstantin/FreeBSD/iconv/ 364@end itemize 365 366@* 367Moreover, http://www.dante.net/staff/konstantin/FreeBSD/iconv/ contains a 368lot of additional CCS tables that you can use with Newlib (iso-2022 and 369RFC1345 encodings). 370 371@page 372@node Adding new converter 373@section Adding a new iconv converter 374@* 375The following steps should be taken to add a new iconv converter: 376 377@* 378@enumerate 379@item 380Converter's name and aliases list should be added to 381the iconv/charset.aliases file 382@item 383All iconv converters are protected by a _ICONV_CONVERTER_XXX 384macro, where XXX is converter name. This protection macro should be added to 385newlib/newlib.hin file. 386@item 387Converter's name and aliases should be also registered in _iconv_builtin_aliases 388table in iconv/lib/bialiasesi.c. The list should be protected by 389the corresponding macro mentioned above. 390@item 391If a new converter is just a CCS table, the corresponding .cct and .c files 392should be added to the iconv/ccs/ subdirectory. The name of the files 393should be equivalent to the normalized encoding name. The 'iconv_mktbl' 394Perl script (found in iconv/ccs) may 395be used to generate such files. The file's name should be added to 396iconv/ccs/Makefile.am and iconv/ccs/binary/Makefile.am files and then 397automake should be used to regenerate the Makefile.in files. 398@item 399If a new converter has a CES algorithm, the appropriate file should be 400added to the 401iconv/ces/ subdirectory. The name of the file again should be equivalent 402to the normalized 403encoding name. 404@item 405If a converter is EUC or ISO-2022-family CES, then the converter 406is just an array with a list of used CCS (See ccs/euc-jp.c for example). This 407is because iconv already has EUC and ISO-2022 support. Used CCS tables should 408be provided in iconv/ccs/. 409@item 410If a converter isn't EUC or ISO-2022-based CCS, the following two functions 411should be provided (see utf-8.c for example): 412@enumerate - 413@item A function to convert from new CES to UCS-32; 414@item A function to convert from UCS-32 to new CES; 415@item An 'init' function; 416@item A 'close' function; 417@item A 'reset' function to reset shift state for stateful CES. 418@end enumerate 419 420@* 421All these functions are registered into a 'struct iconv_ces_desc' object. 422The name of the object should be _iconv_ces_module_XXX, where XXX is the 423name of the converter. 424@item 425For CES converters the correspondent 'struct iconv_ces_desc' reference should 426be added into iconv/lib/bices.c file. 427 428@* 429For CCS converters, the corresponding table reference should be added into 430the iconv/lib/biccs.c file. 431@end enumerate 432 433