1\input texinfo @c -*-texinfo-*- -*- coding: latin-1 -*- 2@c %**start of header 3@setfilename recode.info 4@settitle The @code{recode} reference manual 5 6@c An index for command-line options 7@defcodeindex op 8@c Put variable and function names together 9@syncodeindex vr fn 10@finalout 11@c %**end of header 12 13@include version.texi 14 15@dircategory Internationalization and character sets 16@direntry 17* recode: (recode). Conversion between character sets and surfaces. 18@end direntry 19 20@ifinfo 21This file documents the @code{recode} command, which has the purpose of 22converting files between various character sets and surfaces. 23 24Copyright (C) 1990, 93, 94, 96, 97, 98, 99, 00 Free Software Foundation, Inc. 25 26Permission is granted to make and distribute verbatim copies of 27this manual provided the copyright notice and this permission notice 28are preserved on all copies. 29 30@ignore 31Permission is granted to process this file through TeX and print the 32results, provided the printed document carries copying permission 33notice identical to this one except for the removal of this paragraph 34(this paragraph not being relevant to the printed manual). 35 36@end ignore 37Permission is granted to copy and distribute modified versions of this 38manual under the conditions for verbatim copying, provided that the entire 39resulting derived work is distributed under the terms of a permission 40notice identical to this one. 41 42Permission is granted to copy and distribute translations of this manual 43into another language, under the above conditions for modified versions, 44except that this permission notice may be stated in a translation approved 45by the Foundation. 46@end ifinfo 47 48@titlepage 49@title Free recode, version @value{VERSION} 50@subtitle The character set converter 51@subtitle Edition @value{EDITION}, @value{UPDATED} 52@author Fran@,{c}ois Pinard 53 54@page 55@vskip 0pt plus 1filll 56Copyright @copyright{} 1993, 94, 97, 98, 99, 00 Free Software Foundation, Inc. 57 58Permission is granted to make and distribute verbatim copies of 59this manual provided the copyright notice and this permission notice 60are preserved on all copies. 61 62Permission is granted to copy and distribute modified versions of this 63manual under the conditions for verbatim copying, provided that the entire 64resulting derived work is distributed under the terms of a permission 65notice identical to this one. 66 67Permission is granted to copy and distribute translations of this manual 68into another language, under the above conditions for modified versions, 69except that this permission notice may be stated in a translation approved 70by the Foundation. 71@end titlepage 72 73@ifnottex 74@node Top, Tutorial, (dir), (dir) 75@top @code{recode} 76 77@c @item @b{@code{recode}} @value{hfillkludge} (UtilT, SrcCD) 78@c 79This recoding library converts files between various coded character 80sets and surface encodings. When this cannot be achieved exactly, it 81may get rid of the offending characters or fall back on approximations. 82The library recognises or produces more than 300 different character sets 83and is able to convert files between almost any pair. Most @w{RFC 1345} 84character sets, and all @code{libiconv} character sets, are supported. 85The @code{recode} program is a handy front-end to the library. 86 87The current @code{recode} release is @value{VERSION}. 88 89@menu 90* Tutorial:: Quick Tutorial 91* Introduction:: Terminology and purpose 92* Invoking recode:: How to use this program 93* Library:: A recoding library 94* Universal:: The universal charset 95* libiconv:: The @code{iconv} library 96* Tabular:: Tabular sources (@w{RFC 1345}) 97* ASCII misc:: ASCII and some derivatives 98* IBM and MS:: Some IBM or Microsoft charsets 99* CDC:: Charsets for CDC machines 100* Micros:: Other micro-computer charsets 101* Miscellaneous:: Various other charsets 102* Surfaces:: All about surfaces 103* Internals:: Internal aspects 104* Concept Index:: Concept Index 105* Option Index:: Option Index 106* Library Index:: Library Index 107* Charset and Surface Index:: Charset and Surface Index 108 109@detailmenu 110 --- The Detailed Node Listing --- 111 112Terminology and purpose 113 114* Charset overview:: Overview of charsets 115* Surface overview:: Overview of surfaces 116* Contributing:: Contributions and bug reports 117 118How to use this program 119 120* Synopsis:: Synopsis of @code{recode} call 121* Requests:: The @var{request} parameter 122* Listings:: Asking for various lists 123* Recoding:: Controlling how files are recoded 124* Reversibility:: Reversibility issues 125* Sequencing:: Selecting sequencing methods 126* Mixed:: Using mixed charset input 127* Emacs:: Using @code{recode} within Emacs 128* Debugging:: Debugging considerations 129 130A recoding library 131 132* Outer level:: Outer level functions 133* Request level:: Request level functions 134* Task level:: Task level functions 135* Charset level:: Charset level functions 136* Errors:: Handling errors 137 138The universal charset 139 140* UCS-2:: Universal Character Set, 2 bytes 141* UCS-4:: Universal Character Set, 4 bytes 142* UTF-7:: Universal Transformation Format, 7 bits 143* UTF-8:: Universal Transformation Format, 8 bits 144* UTF-16:: Universal Transformation Format, 16 bits 145* count-characters:: Frequency count of characters 146* dump-with-names:: Fully interpreted UCS dump 147 148ASCII and some derivatives 149 150* ASCII:: Usual ASCII 151* ISO 8859:: ASCII extended by Latin Alphabets 152* ASCII-BS:: ASCII 7-bits, @kbd{BS} to overstrike 153* flat:: ASCII without diacritics nor underline 154 155Some IBM or Microsoft charsets 156 157* EBCDIC:: EBCDIC codes 158* IBM-PC:: IBM's PC code 159* Icon-QNX:: Unisys' Icon code 160 161Charsets for CDC machines 162 163* Display Code:: Control Data's Display Code 164* CDC-NOS:: ASCII 6/12 from NOS 165* Bang-Bang:: ASCII ``bang bang'' 166 167Other micro-computer charsets 168 169* Apple-Mac:: Apple's Macintosh code 170* AtariST:: Atari ST code 171 172Various other charsets 173 174* HTML:: World Wide Web representations 175* LaTeX:: LaTeX macro calls 176* Texinfo:: GNU project documentation files 177* Vietnamese:: Vietnamese charsets 178* African:: African charsets 179* Others:: Cyrillic and other charsets 180* Texte:: Easy French conventions 181* Mule:: Mule as a multiplexed charset 182 183All about surfaces 184 185* Permutations:: Permuting groups of bytes 186* End lines:: Representation for end of lines 187* MIME:: MIME contents encodings 188* Dump:: Interpreted character dumps 189* Test:: Artificial data for testing 190 191Internal aspects 192 193* Main flow:: Overall organisation 194* New charsets:: Adding new charsets 195* New surfaces:: Adding new surfaces 196* Design:: Comments on the library design 197 198@end detailmenu 199@end menu 200 201@end ifnottex 202 203@node Tutorial, Introduction, Top, Top 204@chapter Quick Tutorial 205 206@cindex @code{recode} use, a tutorial 207@cindex tutorial 208So, really, you just are in a hurry to use @code{recode}, and do not 209feel like studying this manual? Even reading this paragraph slows you down? 210We might have a problem, as you will have to do some guess work, and might 211not become very proficient unless you have a very solid intuition@dots{}. 212 213Let me use here, as a quick tutorial, an actual reply of mine to a 214@code{recode} user, who writes: 215 216@quotation 217My situation is this---I occasionally get email with special characters 218in it. Sometimes this mail is from a user using IBM software and sometimes 219it is a user using Mac software. I myself am on a SPARC Solaris machine. 220@end quotation 221 222Your situation is similar to mine, except that I @emph{often} receive 223email needing recoding, that is, much more than @emph{occasionally}! 224The usual recodings I do are Mac to @w{Latin-1}, IBM page codes to @w{Latin-1}, 225Easy-French to @w{Latin-1}, remove Quoted-Printable, remove Base64. These are 226so frequent that I made myself a few two-keystroke Emacs commands to filter 227the Emacs region. This is very convenient for me. I also resort to many 228other email conversions, yet more rarely than the frequent cases above. 229 230@quotation 231It @emph{seems} like this should be doable using @code{recode}. However, 232when I try something like @samp{grecode mac macfile.txt} I get nothing 233out---no error, no output, nothing. 234@end quotation 235 236Presuming you are using some recent version of @code{recode}, the command: 237 238@example 239recode mac macfile.txt 240@end example 241 242@noindent 243is a request for recoding @file{macfile.txt} over itself, overwriting the 244original, from Macintosh usual character code and Macintosh end of lines, 245to @w{Latin-1} and Unix end of lines. This is overwrite mode. If you want 246to use @code{recode} as a filter, which is probably what you need, rather do: 247 248@example 249recode mac 250@end example 251 252@noindent 253and give your Macintosh file as standard input, you'll get the @w{Latin-1} 254file on standard output. The above command is an abbreviation for any of: 255 256@example 257recode mac.. 258recode mac..l1 259recode mac..Latin-1 260recode mac/CR..Latin-1/ 261recode Macintosh..ISO_8859-1 262recode Macintosh/CR..ISO_8859-1/ 263@end example 264 265That is, a @code{CR} surface, encoding newlines with ASCII @key{CR}, is 266first to be removed (this is a default surface for @samp{mac}), then the 267Macintosh charset is converted to @w{Latin-1} and no surface is added to the 268result (there is no default surface for @samp{l1}). If you want @samp{mac} 269code converted, but you know that newlines are already coded the Unix way, 270just do: 271 272@example 273recode mac/ 274@end example 275 276@noindent 277the slash then overriding the default surface with empty, that is, none. 278Here are other easy recipes: 279 280@example 281recode pc to filter IBM-PC code and CR-LF (default) to Latin-1 282recode pc/ to filter IBM-PC code to Latin-1 283recode 850 to filter code page 850 and CR-LF (default) to Latin-1 284recode 850/ to filter code page 850 to Latin-1 285recode /qp to remove quoted printable 286@end example 287 288The last one is indeed equivalent to any of: 289 290@example 291recode /qp.. 292recode l1/qp..l1/ 293recode ISO_8859-1/Quoted-Printable..ISO_8859-1/ 294@end example 295 296Here are some reverse recipes: 297 298@example 299recode ..mac to filter Latin-1 to Macintosh code and CR (default) 300recode ..mac/ to filter Latin-1 to Macintosh code 301recode ..pc to filter Latin-1 to IBM-PC code and CR-LF (default) 302recode ..pc/ to filter Latin-1 to IBM-PC code 303recode ..850 to filter Latin-1 to code page 850 and CR-LF (default) 304recode ..850/ to filter Latin-1 to code page 850 305recode ../qp to force quoted printable 306@end example 307 308In all the above calls, replace @samp{recode} by @samp{recode -f} if you 309want to proceed despite recoding errors. If you do not use @samp{-f} 310and there is an error, the recoding output will be interrupted after first 311error in filter mode, or the file will not be replaced by a recoded copy 312in overwrite mode. 313 314You may use @samp{recode -l} to get a list of available charsets and 315surfaces, and @samp{recode --help} to get a quick summary of options. 316The above output is meant for those having already read this manual, so 317let me dare a suggestion: why could not you find a few more minutes in 318your schedule to peek further down, right into the following chapters! 319 320@node Introduction, Invoking recode, Tutorial, Top 321@chapter Terminology and purpose 322 323A few terms are used over and over in this manual, our wise reader will 324learn their meaning right away. Both ISO (International Organization for 325Standardisation) and IETF (Internet Engineering Task Force) have their 326own terminology, this document does not try to stick to either one in a 327strict way, while it does not want to throw more confusion in the field. 328On the other hand, it would not be efficient using paraphrases all the time, 329so @code{recode} coins a few short words, which are explained below. 330 331@cindex charset, what it is 332A @dfn{charset}, in the context of @code{recode}, is a particular association 333between computer codes on one side, and a repertoire of intended characters 334on the other side. Codes are usually taken from a set of consecutive 335small integers, starting at 0. Some characters have a graphical appearance 336(glyph) or displayable effect, others have special uses like, for example, 337to control devices or to interact with neighbouring codes to specify them 338more precisely. So, a @emph{charset} is roughly one of those tables, 339giving a meaning to each of the codes from the set of allowable values. 340MIME also uses the term charset with approximately the same meaning. 341It does @emph{not} exactly corresponds to what ISO calls a @dfn{coded 342character set}, that is, a set of characters with an encoding for them. 343An coded character set does not necessarily use all available code positions, 344while a MIME charset usually tries to specify them all. A MIME charset 345might be the union of a few disjoint coded character sets. 346 347@cindex surface, what it is 348A @dfn{surface} is a term used in @code{recode} only, and is a short for 349surface transformation of a charset stream. This is any kind of mapping, 350usually reversible, which associates physical bits in some medium for 351a stream of characters taken from one or more charsets (usually one). 352A surface is a kind of varnish added over a charset so it fits in actual 353bits and bytes. How end of lines are exactly encoded is not really 354pertinent to the charset, and so, there is surface for end of lines. 355@code{Base64} is also a surface, as we may encode any charset in it. 356Other examples would @code{DES} enciphering, or @code{gzip} compression 357(even if @code{recode} does not offer them currently): these are ways to give 358a real life to theoretical charsets. The @dfn{trivial} surface consists 359into putting characters into fixed width little chunks of bits, usually 360eight such bits per character. But things are not always that simple. 361 362This @code{recode} library, and the program by that name, have the purpose 363of converting files between various charsets and surfaces. When this 364cannot be done in exact ways, as it is often the case, the program may 365get rid of the offending characters or fall back on approximations. 366This library recognises or produces around 175 such charsets under 500 367names, and handle a dozen surfaces. Since it can convert each charset to 368almost any other one, many thousands of different conversions are possible. 369 370The @code{recode} program and library do not usually know how to split and 371sort out textual and non-textual information which may be mixed in a single 372input file. For example, there is no surface which currently addresses the 373problem of how lines are blocked into physical records, when the blocking 374information is added as binary markers or counters within files. So, 375@code{recode} should be given textual streams which are rather @emph{pure}. 376 377This tool pays special attention to superimposition of diacritics for 378some French representations. This orientation is mostly historical, it 379does not impair the usefulness, generality or extensibility of the program. 380@samp{recode} is both a French and English word. For those who pay attention 381to those things, the proper pronunciation is French (that is, @samp{racud}, 382with @samp{a} like in @samp{above}, and @samp{u} like in @samp{cut}). 383 384The program @code{recode} has been written by Fran@,{c}ois Pinard. 385With time, it got to reuse works from other contributors, and notably, 386those of Keld Simonsen and Bruno Haible. 387 388@menu 389* Charset overview:: Overview of charsets 390* Surface overview:: Overview of surfaces 391* Contributing:: Contributions and bug reports 392@end menu 393 394@node Charset overview, Surface overview, Introduction, Introduction 395@section Overview of charsets 396 397@cindex charsets, overview 398Recoding is currently possible between many charsets, the bulk of which is 399described by @w{RFC 1345} tables or available in the @code{iconv} library. 400@xref{Tabular}, and @pxref{libiconv}. The @code{recode} library also 401handles some charsets in some specialised ways. These are: 402 403@itemize @bullet 404@item 4056-bit charsets based on CDC display code: 6/12 code from NOS; bang-bang 406code from Universit@'e de Montr@'eal; 407 408@item 4097-bit ASCII: without any diacritics, or else: using backspace for 410overstriking; Unisys' Icon convention; @TeX{}/La@TeX{} coding; easy 411French conventions for electronic mail; 412 413@item 4148-bit extensions to ASCII: ISO @w{Latin-1}, Atari ST code, IBM's code for 415the PC, Apple's code for the Macintosh; 416 417@item 4188-bit non-ASCII codes: three flavours of EBCDIC; 419 420@item 42116-bit or 31-bit universal characters, and their transfer encodings. 422@end itemize 423 424The introduction of @w{RFC 1345} in @code{recode} has brought with it a few 425charsets having the functionality of older ones, but yet being different 426in subtle ways. The effects have not been fully investigated yet, so for 427now, clashes are avoided, the old and new charsets are kept well separate. 428 429@cindex unavailable conversions 430@cindex conversions, unavailable 431@cindex impossible conversions 432@cindex unreachable charsets 433@cindex exceptions to available conversions 434@cindex pseudo-charsets 435@tindex flat@r{, not as before charset} 436@tindex count-characters@r{, not as before charset} 437@tindex dump-with-names@r{, not as before charset} 438@tindex data@r{, not with charsets} 439@tindex libiconv@r{, not in requests} 440Conversion is possible between almost any pair of charsets. Here is a 441list of the exceptions. One may not recode @emph{from} the @code{flat}, 442@code{count-characters} or @code{dump-with-names} charsets, nor @emph{from} 443or @emph{to} the @code{data}, @code{tree} or @code{:libiconv:} charsets. 444Also, if we except the @code{data} and @code{tree} pseudo-charsets, charsets 445and surfaces live in disjoint recoding spaces, one cannot really transform 446a surface into a charset or vice-versa, as surfaces are only meant to be 447applied over charsets, or removed from them. 448 449@node Surface overview, Contributing, Charset overview, Introduction 450@section Overview of surfaces 451 452@cindex surfaces, overview 453For various practical considerations, it sometimes happens that the codes 454making up a text, written in a particular charset, cannot simply be put 455out in a file one after another without creating problems or breaking 456other things. Sometimes, 8-bit codes cannot be written on a 7-bit medium, 457variable length codes need kind of envelopes, newlines require special 458treatment, etc. We sometimes have to apply @dfn{surfaces} to a stream 459of codes, which surfaces are kind of tricks used to fit the charset into 460those practical constraints. Moreover, similar surfaces or tricks may 461be useful for many unrelated charsets, and many surfaces can be used at 462once over a single charset. 463 464@cindex pure charset 465@cindex charset, pure 466So, @code{recode} has machinery to describe a combination of a charset with 467surfaces used over it in a file. We would use the expression @dfn{pure 468charset} for referring to a charset free of any surface, that is, the 469conceptual association between integer codes and character intents. 470 471It is not always clear if some transformation will yield a charset or a 472surface, especially for those transformations which are only meaningful 473over a single charset. The @code{recode} library is not overly picky as 474identifying surfaces as such: when it is practical to consider a specialised 475surface as if it were a charset, this is preferred, and done. 476 477@node Contributing, , Surface overview, Introduction 478@section Contributions and bug reports 479 480@cindex contributing charsets 481Even being the @code{recode} author and current maintainer, I am no 482specialist in charset standards. I only made @code{recode} along the 483years to solve my own needs, but felt it was applicable for the needs 484of others. Some FSF people liked the program structure and suggested 485to make it more widely available. I often rely on @code{recode} users 486suggestions to decide what is best to be done next. 487 488Properly protecting @code{recode} about possible copyright fights is a 489pain for me and for contributors, but we cannot avoid addressing the issue 490in the long run. Besides, the Free Software Foundation, which mandates 491the GNU project, is very sensible to this matter. GNU standards suggest 492that we stay cautious before looking at copyrighted code. The safest and 493simplest way for me is to gather ideas and reprogram them anew, even if 494this might slow me down considerably. For contributions going beyond a 495few lines of code here and there, the FSF definitely requires employer 496disclaimers and copyright assignments in writing. 497 498When you contribute something to @code{recode}, @emph{please} explain what 499it is about. Do not take for granted that I know those charsets which 500are familiar to you. Once again, I'm no expert, and you have to help me. 501Your explanations could well find their way into this documentation, too. 502Also, for contributing new charsets or new surfaces, as much as possible, 503please provide good, solid, verifiable references for the tables you 504used@footnote{I'm not prone at accepting a charset you just invented, 505and which nobody uses yet: convince your friends and community first!}. 506 507Many users contributed to @code{recode} already, I am grateful to them for 508their interest and involvement. Some suggestions can be integrated quickly 509while some others have to be delayed, I have to draw a line somewhere when 510time comes to make a new release, about what would go in it and what would 511go in the next. 512 513@cindex bug reports, where to send 514@cindex reporting bugs 515Please send suggestions, documentation errors and bug reports to 516@email{recode-bugs@@iro.umontreal.ca} or, if you prefer, directly to 517@email{pinard@@iro.umontreal.ca}, Fran@,{c}ois Pinard. Do not be afraid 518to report details, because this program is the mere aggregation of 519hundreds of details. 520 521@node Invoking recode, Library, Introduction, Top 522@chapter How to use this program 523 524With the synopsis of the @code{recode} call, we stress the difference 525between using this program as a file filter, or recoding many files 526at once. The first parameter of any call states the recoding request, 527and this deserves a section on its own. Options are then presented, 528but somewhat grouped according to the related functionalities they 529control. 530 531@menu 532* Synopsis:: Synopsis of @code{recode} call 533* Requests:: The @var{request} parameter 534* Listings:: Asking for various lists 535* Recoding:: Controlling how files are recoded 536* Reversibility:: Reversibility issues 537* Sequencing:: Selecting sequencing methods 538* Mixed:: Using mixed charset input 539* Emacs:: Using @code{recode} within Emacs 540* Debugging:: Debugging considerations 541@end menu 542 543@node Synopsis, Requests, Invoking recode, Invoking recode 544@section Synopsis of @code{recode} call 545 546@cindex @code{recode}, synopsis of invocation 547@cindex invocation of @code{recode}, synopsis 548The general format of the program call is one of: 549 550@example 551recode [@var{option}]@dots{} [@var{charset} | @var{request} [@var{file}]@dots{} ] 552@end example 553 554Some calls are used only to obtain lists produced by @code{recode} itself, 555without actually recoding any file. They are recognised through the 556usage of listing options, and these options decide what meaning should 557be given to an optional @var{charset} parameter. @xref{Listings}. 558 559In other calls, the first parameter (@var{request}) always explains which 560transformations are expected on the files. There are many variations to 561the aspect of this parameter. We will discuss more complex situations 562later (@pxref{Requests}), but for many simple cases, this parameter 563merely looks like this@footnote{In previous versions or @code{recode}, a single 564colon @samp{:} was used instead of the two dots @samp{..} for separating 565charsets, but this was creating problems because colons are allowed in 566official charset names. The old request syntax is still recognised for 567compatibility purposes, but is deprecated.}: 568 569@example 570@var{before}..@var{after} 571@end example 572 573@noindent 574where @var{before} and @var{after} each gives the name of a charset. Each 575@var{file} will be read assuming it is coded with charset @var{before}, it 576will be recoded over itself so to use the charset @var{after}. If there 577is no @var{file} on the @code{recode} command, the program rather acts 578as a Unix filter and transforms standard input onto standard output. 579@cindex filter operation 580@cindex @code{recode}, operation as filter 581 582The capability of recoding many files at once is very convenient. 583For example, one could easily prepare a distribution from @w{Latin-1} to MSDOS, 584this way: 585 586@example 587mkdir package 588cp -p Makefile *.[ch] package 589recode Latin-1..MSDOS package/* 590zoo ah package.zoo package/* 591rm -rf package 592@end example 593 594@noindent 595(In this example, the non-mandatory @samp{-p} option to @code{cp} is for 596preserving timestamps, and the @code{zoo} program is an archiver from 597Rahul Dhesi which once was quite popular.) 598 599The filter operation is especially useful when the input files should 600not be altered. Let us make an example to illustrate this point. 601Suppose that someone has a file named @file{datum.txt}, which is almost 602a @TeX{} file, except that diacriticised characters are written using 603@w{Latin-1}. To complete the recoding of the diacriticised characters 604@emph{only} and produce a file @file{datum.tex}, without destroying 605the original, one could do: 606 607@example 608cp -p datum.txt datum.tex 609recode -d l1..tex datum.tex 610@end example 611 612However, using @code{recode} as a filter will achieve the same goal more 613neatly: 614 615@example 616recode -d l1..tex <datum.txt >datum.tex 617@end example 618 619This example also shows that @code{l1} could be used instead of 620@code{Latin-1}; charset names often have such aliases. 621 622@node Requests, Listings, Synopsis, Invoking recode 623@section The @var{request} parameter 624 625In the case where the @var{request} is merely written as 626@var{before}..@var{after}, then @var{before} and @var{after} specify the 627start charset and the goal charset for the recoding. 628 629@cindex charset names, valid characters 630@cindex valid characters in charset names 631For @code{recode}, charset names may contain any character, besides a 632comma, a forward slash, or two periods in a row. But in practice, charset 633names are currently limited to alphabetic letters (upper or lower case), 634digits, hyphens, underlines, periods, colons or round parentheses. 635 636@cindex request, syntax 637@cindex @code{recode} request syntax 638The complete syntax for a valid @var{request} allows for unusual 639things, which might surprise at first. (Do not pay too much attention 640to these facilities on first reading.) For example, @var{request} 641may also contain intermediate charsets, like in the following example: 642 643@example 644@var{before}..@var{interim1}..@var{interim2}..@var{after} 645@end example 646 647@noindent 648@cindex intermediate charsets 649@cindex chaining of charsets in a request 650@cindex charsets, chaining in a request 651meaning that @code{recode} should internally produce the @var{interim1} 652charset from the start charset, then work out of this @var{interim1} 653charset to internally produce @var{interim2}, and from there towards the 654goal charset. In fact, @code{recode} internally combines recipes and 655automatically uses interim charsets, when there is no direct recipe for 656transforming @var{before} into @var{after}. But there might be many ways 657to do it. When many routes are possible, the above @dfn{chaining} syntax 658may be used to more precisely force the program towards a particular route, 659which it might not have naturally selected otherwise. On the other hand, 660because @code{recode} tries to choose good routes, chaining is only needed 661to achieve some rare, unusual effects. 662 663Moreover, many such requests (sub-requests, more precisely) may be 664separated with commas (but no spaces at all), indicating a sequence 665of recodings, where the output of one has to serve as the input of the 666following one. For example, the two following requests are equivalent: 667 668@example 669@var{before}..@var{interim1}..@var{interim2}..@var{after} 670@var{before}..@var{interim1},@var{interim1}..@var{interim2},@var{interim2}..@var{after} 671@end example 672 673@noindent 674In this example, the charset input for any recoding sub-request is identical 675to the charset output by the preceding sub-request. But it does not have 676to be so in the general case. One might wonder what would be the meaning 677of declaring the charset input for a recoding sub-request of being of 678different nature than the charset output by a preceding sub-request, when 679recodings are chained in this way. Such a strange usage might have a 680meaning and be useful for the @code{recode} expert, but they are quite 681uncommon in practice. 682 683@cindex surfaces, syntax 684More useful is the distinction between the concept of charset, and 685the concept of surfaces. An encoded charset is represented by: 686 687@example 688@var{pure-charset}/@var{surface1}/@var{surface2}@dots{} 689@end example 690 691@noindent 692@cindex surfaces, commutativity 693@cindex commutativity of surfaces 694using slashes to introduce surfaces, if any. The order of application 695of surfaces is usually important, they cannot be freely commuted. In the 696given example, @var{surface1} is first applied over the @var{pure-charset}, 697then @var{surface2} is applied over the result. Given this request: 698 699@example 700@var{before}/@var{surface1}/@var{surface2}..@var{after}/@var{surface3} 701@end example 702 703@noindent 704the @code{recode} program will understand that the input files should 705have @var{surface2} removed first (because it was applied last), then 706@var{surface1} should be removed. The next step will be to translate the 707codes from charset @var{before} to charset @var{after}, prior to applying 708@var{surface3} over the result. 709 710@cindex implied surfaces 711@cindex surfaces, implied 712@tindex IBM-PC charset, and CR-LF surface 713Some charsets have one or more @emph{implied} surfaces. In this case, the 714implied surfaces are automatically handled merely by naming the charset, 715without any explicit surface to qualify it. Let's take an example to 716illustrate this feature. The request @samp{pc..l1} will indeed decode MS-DOS 717end of lines prior to converting IBM-PC codes to @w{Latin-1}, because @samp{pc} 718is the name of a charset@footnote{More precisely, @code{pc} is an alias for 719the charset @code{IBM-PC}.} which has @code{CR-LF} for its usual surface. 720The request @samp{pc/..l1} will @emph{not} decode end of lines, since 721the slash introduces surfaces, and even if the surface list is empty, it 722effectively defeats the automatic removal of surfaces for this charset. 723So, empty surfaces are useful, indeed! 724 725@cindex aliases 726@cindex alternate names for charsets and surfaces 727@cindex charsets, aliases 728@cindex surfaces, aliases 729Both charsets and surfaces may have predefined alternate names, or aliases. 730However, and this is rather important to understand, implied surfaces 731are attached to individual aliases rather than on genuine charsets. 732Consequently, the official charset name and all of its aliases do not 733necessarily share the same implied surfaces. The charset and all its 734aliases may each have its own different set of implied surfaces. 735 736@cindex abbreviated names for charsets and surfaces 737@cindex names of charsets and surfaces, abbreviation 738Charset names, surface names, or their aliases may always be abbreviated 739to any unambiguous prefix. Internally in @code{recode}, disambiguating 740tables are kept separate for charset names and surface names. 741 742@cindex letter case, in charset and surface names 743While recognising a charset name or a surface name (or aliases thereof), 744@code{recode} ignores all characters besides letters and digits, so for 745example, the hyphens and underlines being part of an official charset 746name may safely be omitted (no need to un-confuse them!). There is also 747no distinction between upper and lower case for charset or surface names. 748 749One of the @var{before} or @var{after} keywords may be omitted. If the 750double dot separator is omitted too, then the charset is interpreted as 751the @var{before} charset.@footnote{Both @var{before} and @var{after} may 752be omitted, in which case the double dot separator is mandatory. This is 753not very useful, as the recoding reduces to a mere copy in that case.} 754 755@cindex default charset 756@cindex charset, default 757@vindex DEFAULT_CHARSET 758When a charset name is omitted or left empty, the value of the 759@code{DEFAULT_CHARSET} variable in the environment is used instead. If this 760variable is not defined, the @code{recode} library uses the current locale's 761encoding. On POSIX compliant systems, this depends on the first non-empty 762value among the environment variables LC_ALL, LC_CTYPE, LANG, and can be 763determined through the command @samp{locale charmap}. 764 765If the charset name is omitted but followed by surfaces, the surfaces 766then qualify the usual or default charset. For example, the request 767@samp{../x} is sufficient for applying an hexadecimal surface to the input 768text@footnote{MS-DOS is one of those systems for which the default charset 769has implied surfaces, @code{CR-LF} here. Such surfaces are automatically 770removed or applied whenever the default charset is read or written, 771exactly as it would go for any other charset. In the example above, on 772such systems, the hexadecimal surface would then @emph{replace} the implied 773surfaces. For @emph{adding} an hexadecimal surface without removing any, 774one should write the request as @samp{/../x}.}. 775 776The allowable values for @var{before} or @var{after} charsets, and various 777surfaces, are described in the remainder of this document. 778 779@node Listings, Recoding, Requests, Invoking recode 780@section Asking for various lists 781 782Many options control listing output generated by @code{recode} itself, 783they are not meant to accompany actual file recodings. These options are: 784 785@table @samp 786 787@item --version 788@opindex --version 789@cindex @code{recode} version, printing 790The program merely prints its version numbers on standard output, and 791exits without doing anything else. 792 793@item --help 794@opindex --help 795@cindex help page, printing 796The program merely prints a page of help on standard output, and exits 797without doing any recoding. 798 799@item -C 800@itemx --copyright 801@opindex -C 802@opindex --copyright 803@cindex copyright conditions, printing 804Given this option, all other parameters and options are ignored. The 805program prints briefly the copyright and copying conditions. See the 806file @file{COPYING} in the distribution for full statement of the 807Copyright and copying conditions. 808 809@item -h[@var{language}/][@var{name}] 810@itemx --header[=[@var{language}/][@var{name}]] 811@opindex -h 812@opindex --header 813@cindex source file generation 814@cindex programming language support 815@cindex languages, programming 816@cindex supported programming languages 817Instead of recoding files, @code{recode} writes a @var{language} source 818file on standard output and exits. This source is meant to be included 819in a regular program written in the same programming @var{language}: 820its purpose is to declare and initialise an array, named @var{name}, 821which represents the requested recoding. The only acceptable values for 822@var{language} are @samp{c} or @samp{perl}, and may may be abbreviated. 823If @var{language} is not specified, @samp{c} is assumed. If @var{name} 824is not specified, then it defaults to @samp{@var{before}_@var{after}}. 825Strings @var{before} and @var{after} are cleaned before being used according 826to the syntax of @var{language}. 827 828Even if @code{recode} tries its best, this option does not always succeed in 829producing the requested source table. It will however, provided the recoding 830can be internally represented by only one step after the optimisation phase, 831and if this merged step conveys a one-to-one or a one-to-many explicit 832table. Also, when attempting to produce sources tables, @code{recode} 833relaxes its checking a tiny bit: it ignores the algorithmic part of some 834tabular recodings, it also avoids the processing of implied surfaces. 835But this is all fairly technical. Better try and see! 836 837Beware that other options might affect the produced source tables, these 838are: @samp{-d}, @samp{-g} and, particularly, @samp{-s}. 839 840@item -k @var{pairs} 841@itemx --known=@var{pairs} 842@opindex -k 843@opindex --known= 844@cindex unknown charsets 845@cindex guessing charsets 846@cindex charsets, guessing 847This particular option is meant to help identifying an unknown charset, 848using as hints some already identified characters of the charset. Some 849examples will help introducing the idea. 850 851Let's presume here that @code{recode} is run in an ISO-8859-1 locale, and 852that @code{DEFAULT_CHARSET} is unset in the environment. 853Suppose you have guessed that code 130 (decimal) of the unknown charset 854represents a lower case @samp{e} with an acute accent. That is to say 855that this code should map to code 233 (decimal) in the usual charset. 856By executing: 857 858@example 859recode -k 130:233 860@end example 861 862@noindent 863you should obtain a listing similar to: 864 865@example 866AtariST atarist 867CWI cphu cwi cwi2 868IBM437 437 cp437 ibm437 869IBM850 850 cp850 ibm850 870IBM851 851 cp851 ibm851 871IBM852 852 cp852 ibm852 872IBM857 857 cp857 ibm857 873IBM860 860 cp860 ibm860 874IBM861 861 cp861 cpis ibm861 875IBM863 863 cp863 ibm863 876IBM865 865 cp865 ibm865 877@end example 878 879You can give more than one clue at once, to restrict the list further. 880Suppose you have @emph{also} guessed that code 211 of the unknown 881charset represents an upper case @samp{E} with diaeresis, that is, code 882203 in the usual charset. By requesting: 883 884@example 885recode -k 130:233,211:203 886@end example 887 888@noindent 889you should obtain: 890 891@example 892IBM850 850 cp850 ibm850 893IBM852 852 cp852 ibm852 894IBM857 857 cp857 ibm857 895@end example 896 897The usual charset may be overridden by specifying one non-option argument. 898For example, to request the list of charsets for which code 130 maps to 899code 142 for the Macintosh, you may ask: 900 901@example 902recode -k 130:142 mac 903@end example 904 905@noindent 906and get: 907 908@example 909AtariST atarist 910CWI cphu cwi cwi2 911IBM437 437 cp437 ibm437 912IBM850 850 cp850 ibm850 913IBM851 851 cp851 ibm851 914IBM852 852 cp852 ibm852 915IBM857 857 cp857 ibm857 916IBM860 860 cp860 ibm860 917IBM861 861 cp861 cpis ibm861 918IBM863 863 cp863 ibm863 919IBM865 865 cp865 ibm865 920@end example 921 922@noindent 923which, of course, is identical to the result of the first example, since 924the code 142 for the Macintosh is a small @samp{e} with acute. 925 926More formally, option @samp{-k} lists all possible @emph{before} 927charsets for the @emph{after} charset given as the sole non-option 928argument to @code{recode}, but subject to restrictions given in 929@var{pairs}. If there is no non-option argument, the @emph{after} 930charset is taken to be the default charset for this @code{recode}. 931 932The restrictions are given as a comma separated list of pairs, each pair 933consisting of two numbers separated by a colon. The numbers are taken 934as decimal when the initial digit is between @samp{1} and @samp{9}; 935@samp{0x} starts an hexadecimal number, or else @samp{0} starts an 936octal number. The first number is a code in any @emph{before} charset, 937while the second number is a code in the specified @emph{after} charset. 938If the first number would not be transformed into the second number by 939recoding from some @emph{before} charset to the @emph{after} charset, 940then this @emph{before} charset is rejected. A @emph{before} charset is 941listed only if it is not rejected by any pair. The program will only test 942those @emph{before} charsets having a tabular style internal description 943(@pxref{Tabular}), so should be the selected @emph{after} charset. 944 945The produced list is in fact a subset of the list produced by the 946option @samp{-l}. As for option @samp{-l}, the non-option argument 947is interpreted as a charset name, possibly abbreviated to any non 948ambiguous prefix. 949 950@item -l[@var{format}] 951@itemx --list[=@var{format}] 952@opindex -l 953@opindex --list 954@cindex listing charsets 955@cindex information about charsets 956This option asks for information about all charsets, or about one 957particular charset. No file will be recoded. 958 959If there is no non-option arguments, @code{recode} ignores the @var{format} 960value of the option, it writes a sorted list of charset names on standard 961output, one per line. When a charset name have aliases or synonyms, 962they follow the true charset name on its line, sorted from left to right. 963Each charset or alias is followed by its implied surfaces, if any. This list 964is over two hundred lines. It is best used with @samp{grep -i}, as in: 965 966@example 967recode -l | grep -i greek 968@end example 969 970There might be one non-option argument, in which case it is interpreted 971as a charset name, possibly abbreviated to any non ambiguous prefix. 972This particular usage of the @samp{-l} option is obeyed @emph{only} for 973charsets having a tabular style internal description (@pxref{Tabular}). 974Even if most charsets have this property, some do not, and the option 975@samp{-l} cannot be used to detail these particular charsets. For knowing 976if a particular charset can be listed this way, you should merely try 977and see if this works. The @var{format} value of the option is a keyword 978from the following list. Keywords may be abbreviated by dropping suffix 979letters, and even reduced to the first letter only: 980 981@table @samp 982@item decimal 983This format asks for the production on standard output of a concise 984tabular display of the charset, in which character code values are 985expressed in decimal. 986 987@item octal 988This format uses octal instead of decimal in the concise tabular display 989of the charset. 990 991@item hexadecimal 992This format uses hexadecimal instead of decimal in the concise tabular 993display of the charset. 994 995@item full 996This format requests an extensive display of the charset on standard output, 997using one line per character showing its decimal, hexadecimal, octal and 998@code{UCS-2} code values, and also a descriptive comment which should be 999the 10646 name for the character. 1000 1001@vindex LANGUAGE@r{, when listing charsets} 1002@vindex LANG@r{, when listing charsets} 1003@cindex French description of charsets 1004The descriptive comment is given in English and ASCII, yet if the English 1005description is not available but a French one is, then the French description 1006is given instead, using @w{Latin-1}. However, if the @code{LANGUAGE} 1007or @code{LANG} environment variable begins with the letters @samp{fr}, 1008then listing preference goes to French when both descriptions are available. 1009@end table 1010 1011When option @samp{-l} is used together with a @var{charset} argument, 1012the @var{format} defaults to @code{decimal}. 1013 1014@item -T 1015@itemx --find-subsets 1016@opindex -T 1017@opindex --find-subsets 1018@cindex identifying subsets in charsets 1019@cindex subsets in charsets 1020This option is a maintainer tool for evaluating the redundancy of those 1021charsets, in @code{recode}, which are internally represented by an @code{UCS-2} 1022data table. After the listing has been produced, the program exits 1023without doing any recoding. The output is meant to be sorted, like 1024this: @w{@samp{recode -T | sort}}. The option triggers @code{recode} into 1025comparing all pairs of charsets, seeking those which are subsets of others. 1026The concept and results are better explained through a few examples. 1027Consider these three sample lines from @samp{-T} output: 1028 1029@example 1030[ 0] IBM891 == IBM903 1031[ 1] IBM1004 < CP1252 1032[ 12] INVARIANT < CSA_Z243.4-1985-1 1033@end example 1034 1035@noindent 1036The first line means that @code{IBM891} and @code{IBM903} are completely 1037identical as far as @code{recode} is concerned, so one is fully redundant 1038to the other. The second line says that @code{IBM1004} is wholly 1039contained within @code{CP1252}, yet there is a single character which is 1040in @code{CP1252} without being in @code{IBM1004}. The third line says 1041that @code{INVARIANT} is wholly contained within @code{CSA_Z243.4-1985-1}, 1042but twelve characters are in @code{CSA_Z243.4-1985-1} without being in 1043@code{INVARIANT}. The whole output might most probably be reduced and 1044made more significant through a transitivity study. 1045@end table 1046 1047@node Recoding, Reversibility, Listings, Invoking recode 1048@section Controlling how files are recoded 1049 1050The following options have the purpose of giving the user some fine 1051grain control over the recoding operation themselves. 1052 1053@table @samp 1054 1055@item -c 1056@itemx --colons 1057@opindex -c 1058@opindex --colons 1059@cindex diaeresis 1060With @code{Texte} Easy French conventions, use the column @kbd{:} 1061instead of the double-quote @kbd{"} for marking diaeresis. 1062@xref{Texte}. 1063 1064@item -g 1065@itemx --graphics 1066@opindex -g 1067@opindex --graphics 1068@cindex IBM graphics characters 1069@cindex box-drawing characters 1070This option is only meaningful while getting @emph{out} of the 1071@code{IBM-PC} charset. In this charset, characters 176 to 223 are used 1072for constructing rulers and boxes, using simple or double horizontal or 1073vertical lines. This option forces the automatic selection of ASCII 1074characters for approximating these rulers and boxes, at cost of making 1075the transformation irreversible. Option @samp{-g} implies @samp{-f}. 1076 1077@item -t 1078@itemx --touch 1079@opindex -t 1080@opindex --touch 1081@cindex time stamps of files 1082@cindex file time stamps 1083The @emph{touch} option is meaningful only when files are recoded over 1084themselves. Without it, the time-stamps associated with files are 1085preserved, to reflect the fact that changing the code of a file does not 1086really alter its informational contents. When the user wants the 1087recoded files to be time-stamped at the recoding time, this option 1088inhibits the automatic protection of the time-stamps. 1089 1090@item -v 1091@itemx --verbose 1092@opindex -v 1093@opindex --verbose 1094@cindex verbose operation 1095@cindex details about recoding 1096@cindex recoding details 1097@cindex quality of recoding 1098Before doing any recoding, the program will first print on the @code{stderr} 1099stream the list of all intermediate charsets planned for recoding, starting 1100with the @var{before} charset and ending with the @var{after} charset. 1101It also prints an indication of the recoding quality, as one of the word 1102@samp{reversible}, @samp{one to one}, @samp{one to many}, @samp{many to 1103one} or @samp{many to many}. 1104 1105This information will appear once or twice. It is shown a second time 1106only when the optimisation and step merging phase succeeds in replacing 1107many single steps by a new one. 1108 1109This option also has a second effect. The program will print on 1110@code{stderr} one message per recoded @var{file}, so as to keep the user 1111informed of the progress of its command. 1112 1113An easy way to know beforehand the sequence or quality of a recoding is 1114by using the command such as: 1115 1116@example 1117recode -v @var{before}..@var{after} < /dev/null 1118@end example 1119 1120@noindent 1121using the fact that, in @code{recode}, an empty input file produces 1122an empty output file. 1123 1124@item -x @var{charset} 1125@itemx --ignore=@var{charset} 1126@opindex -x 1127@opindex --ignore 1128@cindex ignore charsets 1129@cindex recoding path, rejection 1130This option tells the program to ignore any recoding path through the 1131specified @var{charset}, so disabling any single step using this charset 1132as a start or end point. This may be used when the user wants to force 1133@code{recode} into using an alternate recoding path (yet using chained 1134requests offers a finer control, @pxref{Requests}). 1135 1136@var{charset} may be abbreviated to any unambiguous prefix. 1137@end table 1138 1139@node Reversibility, Sequencing, Recoding, Invoking recode 1140@section Reversibility issues 1141 1142The following options are somewhat related to reversibility issues: 1143 1144@table @samp 1145@item -f 1146@itemx --force 1147@opindex -f 1148@opindex --force 1149@cindex force recoding 1150@cindex irreversible recoding 1151With this option, irreversible or otherwise erroneous recodings are run 1152to completion, and @code{recode} does not exit with a non-zero status if 1153it would be only because irreversibility matters. @xref{Reversibility}. 1154 1155Without this option, @code{recode} tries to protect you against recoding 1156a file irreversibly over itself@footnote{There are still some cases of 1157ambiguous output which are rather difficult to detect, and for which 1158the protection is not active.}. Whenever an irreversible recoding is 1159met, or any other recoding error, @code{recode} produces a warning on 1160standard error. The current input file does not get replaced by its 1161recoded version, and @code{recode} then proceeds with the recoding of 1162the next file. 1163 1164When the program is merely used as a filter, standard output will have 1165received a partially recoded copy of standard input, up to the first 1166error point. After all recodings have been done or attempted, and if 1167some recoding has been aborted, @code{recode} exits with a non-zero status. 1168 1169In releases of @code{recode} prior to version 3.5, this option was always 1170selected, so it was rather meaningless. Nevertheless, users were invited 1171to start using @samp{-f} right away in scripts calling @code{recode} 1172whenever convenient, in preparation for the current behaviour. 1173 1174@item -q 1175@itemx --quiet 1176@itemx --silent 1177@opindex -q 1178@opindex --quiet 1179@opindex --silent 1180@cindex suppressing diagnostic messages 1181@cindex error messages, suppressing 1182@cindex silent operation 1183This option has the sole purpose of inhibiting warning messages about 1184irreversible recodings, and other such diagnostics. It has no other 1185effect, in particular, it does @emph{not} prevent recodings to be aborted 1186or @code{recode} to return a non-zero exit status when irreversible 1187recodings are met. 1188 1189This option is set automatically for the children processes, when recode 1190splits itself in many collaborating copies. Doing so, the diagnostic is 1191issued only once by the parent. See option @samp{-p}. 1192 1193@item -s 1194@itemx --strict 1195@opindex -s 1196@opindex --strict 1197@cindex strict operation 1198@cindex map filling, disable 1199@cindex disable map filling 1200By using this option, the user requests that @code{recode} be very strict 1201while recoding a file, merely losing in the transformation any character 1202which is not explicitly mapped from a charset to another. Such a loss is 1203not reversible and so, will bring @code{recode} to fail, unless the option 1204@samp{-f} is also given as a kind of counter-measure. 1205 1206Using @samp{-s} without @samp{-f} might render the @code{recode} program 1207very susceptible to the slighest file abnormalities. Despite the fact 1208that it might be 1209irritating to some users, such paranoia is sometimes wanted and useful. 1210@end table 1211 1212@cindex reversibility of recoding 1213Even if @code{recode} tries hard to keep the recodings reversible, 1214you should not develop an unconditional confidence in its ability to 1215do so. You @emph{ought} to keep only reasonable expectations about 1216reverse recodings. In particular, consider: 1217 1218@itemize @bullet 1219@item 1220Most transformations are fully reversible for all inputs, but lose this 1221property whenever @samp{-s} is specified. 1222 1223@item 1224A few transformations are not meant to be reversible, by design. 1225 1226@item 1227Reversibility sometimes depends on actual file contents and cannot 1228be ascertained beforehand, without reading the file. 1229 1230@item 1231Reversibility is never absolute across successive versions of this 1232program. Even correcting a small bug in a mapping could induce slight 1233discrepancies later. 1234 1235@item 1236Reversibility is easily lost by merging. This is best explained through 1237an example. If you reversibly recode a file from charset @var{A} to 1238charset @var{B}, then you reversibly recode the result from charset 1239@var{B} to charset @var{C}, you cannot expect to recover the original 1240file by merely recoding from charset @var{C} directly to charset @var{A}. 1241You will instead have to recode from charset @var{C} back to charset 1242@var{B}, and only then from charset @var{B} to charset @var{A}. 1243 1244@item 1245Faulty files create a particular problem. Consider an example, recoding 1246from @code{IBM-PC} to @code{Latin-1}. End of lines are represented as 1247@samp{\r\n} in @code{IBM-PC} and as @samp{\n} in @code{Latin-1}. There 1248is no way by which a faulty @code{IBM-PC} file containing a @samp{\n} 1249not preceded by @samp{\r} be translated into a @code{Latin-1} file, and 1250then back. 1251 1252@item 1253There is another difficulty arising from code equivalences. For 1254example, in a @code{LaTeX} charset file, the string @samp{\^\i@{@}} 1255could be recoded back and forth through another charset and become 1256@samp{\^@{\i@}}. Even if the resulting file is equivalent to the 1257original one, it is not identical. 1258@end itemize 1259 1260@cindex map filling 1261Unless option @samp{-s} is used, @code{recode} automatically tries to 1262fill mappings with invented correspondences, often making them fully 1263reversible. This filling is not made at random. The algorithm tries to 1264stick to the identity mapping and, when this is not possible, it prefers 1265generating many small permutation cycles, each involving only a few 1266codes. 1267 1268For example, here is how @code{IBM-PC} code 186 gets translated to 1269@kbd{control-U} in @code{Latin-1}. @kbd{Control-U} is 21. Code 21 is the 1270@code{IBM-PC} section sign, which is 167 in @code{Latin-1}. @code{recode} 1271cannot reciprocate 167 to 21, because 167 is the masculine ordinal indicator 1272within @code{IBM-PC}, which is 186 in @code{Latin-1}. Code 186 within 1273@code{IBM-PC} has no @code{Latin-1} equivalent; by assigning it back to 21, 1274@code{recode} closes this short permutation loop. 1275 1276As a consequence of this map filling, @code{recode} may sometimes produce 1277@emph{funny} characters. They may look annoying, they are nevertheless 1278helpful when one changes his (her) mind and wants to revert to the prior 1279recoding. If you cannot stand these, use option @samp{-s}, which asks 1280for a very strict recoding. 1281 1282This map filling sometimes has a few surprising consequences, which 1283some users wrongly interpreted as bugs. Here are two examples. 1284 1285@enumerate 1286@item 1287In some cases, @code{recode} seems to copy a file without recoding it. 1288But in fact, it does. Consider a request: 1289 1290@example 1291recode l1..us < File-Latin1 > File-ASCII 1292cmp File-Latin1 File-ASCII 1293@end example 1294 1295@noindent 1296then @code{cmp} will not report any difference. This is quite normal. 1297@w{@code{Latin-1}} gets correctly recoded to ASCII for charsets commonalities 1298(which are the first 128 characters, in this case). The remaining last 1299128 @w{@code{Latin-1}} characters have no ASCII correspondent. Instead 1300of losing 1301them, @code{recode} elects to map them to unspecified characters of ASCII, so 1302making the recoding reversible. The simplest way of achieving this is 1303merely to keep those last 128 characters unchanged. The overall effect 1304is copying the file verbatim. 1305 1306If you feel this behaviour is too generous and if you do not wish to 1307care about reversibility, simply use option @samp{-s}. By doing so, 1308@code{recode} will strictly map only those @w{@code{Latin-1}} characters 1309which have 1310an ASCII equivalent, and will merely drop those which do not. Then, 1311there is more chance that you will observe a difference between the 1312input and the output file. 1313 1314@item 1315Recoding the wrong way could sometimes give the false impression that 1316recoding has @emph{almost} been done properly. Consider the requests: 1317 1318@example 1319recode 437..l1 < File-Latin1 > Temp1 1320recode 437..l1 < Temp1 > Temp2 1321@end example 1322 1323@noindent 1324so declaring wrongly @file{File-Latin1} to be an IBM-PC file, and 1325recoding to @code{Latin-1}. This is surely ill defined and not meaningful. 1326Yet, if you repeat this step a second time, you might notice that 1327many (not all) characters in @file{Temp2} are identical to those in 1328@file{File-Latin1}. Sometimes, people try to discover how @code{recode} 1329works by experimenting a little at random, rather than reading and 1330understanding the documentation; results such as this are surely confusing, 1331as they provide those people with a false feeling that they understood 1332something. 1333 1334Reversible codings have this property that, if applied several times 1335in the same direction, they will eventually bring any character back 1336to its original value. Since @code{recode} seeks small permutation 1337cycles when creating reversible codings, besides characters unchanged 1338by the recoding, most permutation cycles will be of length 2, and 1339fewer of length 3, etc. So, it is just expectable that applying the 1340recoding twice in the same direction will recover most characters, 1341but will fail to recover those participating in permutation cycles of 1342length 3. On the other end, recoding six times in the same direction 1343would recover all characters in cycles of length 1, 2, 3 or 6. 1344@end enumerate 1345 1346@node Sequencing, Mixed, Reversibility, Invoking recode 1347@section Selecting sequencing methods 1348 1349@cindex sequencing 1350This program uses a few techniques when it is discovered that many 1351passes are needed to comply with the @var{request}. For example, 1352suppose that four elementary steps were selected at recoding path 1353optimisation time. Then @code{recode} will split itself into four 1354different interconnected tasks, logically equivalent to: 1355 1356@example 1357@var{step1} <@var{input} | @var{step2} | @var{step3} | @var{step4} >@var{output} 1358@end example 1359 1360The splitting into subtasks is often done using Unix pipes. 1361But the splitting may also be completely avoided, and rather 1362simulated by using memory buffer, or intermediate files. The various 1363@samp{--sequence=@var{strategy}} options gives you control over the flow 1364methods, by replacing @var{strategy} with @samp{memory}, @samp{pipe} 1365or @samp{files}. So, these options may be used to override the default 1366behaviour, which is also explained below. 1367 1368@table @samp 1369@item --sequence=memory 1370@opindex --sequence 1371@cindex memory sequencing 1372When the recoding requires a combination of two or more elementary 1373recoding steps, this option forces many passes over the data, using 1374in-memory buffers to hold all intermediary results. 1375@c This should be the default behaviour when 1376@c files to be recoded are @emph{small} enough. 1377 1378@item -i 1379@itemx --sequence=files 1380@opindex -i 1381@cindex file sequencing 1382When the recoding requires a combination of two or more elementary 1383recoding steps, this option forces many passes over the data, using 1384intermediate files between passes. This is the default behaviour when 1385files are recoded over themselves. If this option is selected in filter 1386mode, that is, when the program reads standard input and writes standard 1387output, it might take longer for programs further down the pipe chain to 1388start receiving some recoded data. 1389 1390@item -p 1391@itemx --sequence=pipe 1392@opindex -p 1393@cindex pipe sequencing 1394When the recoding requires a combination of two or more elementary 1395recoding steps, this option forces the program to fork itself into a few 1396copies interconnected with pipes, using the @code{pipe(2)} system call. 1397All copies of the program operate in parallel. This is the default 1398behaviour in filter mode. If this option is used when files are recoded 1399over themselves, this should also save disk space because some temporary 1400files might not be needed, at the cost of more system overhead. 1401 1402If, at installation time, the @code{pipe(2)} call is said to be 1403unavailable, selecting option @samp{-p} is equivalent to selecting 1404option @samp{-i}. (This happens, for example, on MS-DOS systems.) 1405@end table 1406 1407@node Mixed, Emacs, Sequencing, Invoking recode 1408@section Using mixed charset input 1409 1410In real life and practice, textual files are often made up of many charsets 1411at once. Some parts of the file encode one charset, while other parts 1412encode another charset, and so forth. Usually, a file does not toggle 1413between more than two or three charsets. The means to distinguish 1414which charsets are encoded at various places is not always available. 1415The @code{recode} program is able to handle only a few simple cases 1416of mixed input. 1417 1418The default @code{recode} behaviour is to expect pure charset files, to 1419be recoded as other pure charset files. However, the following options 1420allow for a few precise kinds of mixed charset files. 1421 1422@ignore 1423Some notes on transliteration and substitution. 1424 1425Transliteration is still much study, discussion and work to come, but 1426when generic transliteration will be added in @code{recode}, it will be 1427added @emph{through} the @code{recode} library. 1428 1429However, I agree that it might be *convenient* that the `latin1..fi' 1430conversion works by letting all ASCII characters through, but then, the 1431result would be a mix of ASCII and `fi', it would not be pure `fi' anymore. 1432It would be convenient because, in practice, people might write programs in 1433ASCII, keeping comments or strings directly in `fi', all in the same file. 1434The original files are indeed mixed, and people sometimes expect that 1435`recode' will do mixed conversions. 1436 1437A conversion does not become *right* because it is altered to be more 1438convenient. And recode is not *wrong* because it does not offer some 1439conveniences people would like to have. As long as `recode' main job is 1440producing `fi', than '[' is just not representable in `fi', and recode is 1441rather right in not letting `[' through. It has to do something special 1442about it. The character might be thrown away, transliterated or replaced 1443by a substitute, or mapped to some other code for reversibility purposes. 1444 1445Transliteration or substitution are currently not implemented in `recode', 1446yet for the last few years, I've been saving documentation about these 1447phenomena. The transliteration which you are asking for, here, is that the 1448'[' character in @w{Latin-1}, for example, be transliterated to A-umlaut in 1449`fi', which is a bit non-meaningful. Remember, there is no `[' in `fi'. 1450@end ignore 1451 1452@table @samp 1453@item -d 1454@itemx --diacritics 1455@opindex -d 1456@opindex --diacritics 1457@cindex convert a subset of characters 1458@cindex partial conversion 1459While converting to or from one of @code{HTML} or @code{LaTeX} 1460charset, limit conversion to some subset of all characters. 1461For @code{HTML}, limit conversion to the subset of all non-ASCII 1462characters. For @code{LaTeX}, limit conversion to the subset of all 1463non-English letters. This is particularly useful, for example, when 1464people create what would be valid @code{HTML}, @TeX{} or La@TeX{} 1465files, if only they were using provided sequences for applying 1466diacritics instead of using the diacriticised characters directly 1467from the underlying character set. 1468 1469While converting to @code{HTML} or @code{LaTeX} charset, this option 1470assumes that characters not in the said subset are properly coded 1471or protected already, @code{recode} then transmit them literally. 1472While converting the other way, this option prevents translating back 1473coded or protected versions of characters not in the said subset. 1474@xref{HTML}. @xref{LaTeX}. 1475 1476@ignore 1477@item -M 1478@itemx --message 1479@opindex -M 1480@opindex --message 1481Option @samp{-M} would be for messages, it would ideally process @w{RFC 14821522} inserts 1483in ASCII headers, converting them to the goal code, rewriting some MIME 1484header line too, and stopping its special work at the first empty line. 1485A special combination of both capabilities would be for the recoding of 1486PO files, in which the header, and @code{msgid} and @code{msgstr} strings, might 1487all use different charsets. Recoding some PO files currently looks like 1488a nightmare, which I would like @code{recode} to repair. 1489@end ignore 1490 1491@item -S[@var{language}] 1492@itemx --source[=@var{language}] 1493@opindex -S 1494@opindex --source 1495@cindex convert strings and comments 1496@cindex string and comments conversion 1497The bulk of the input file is expected to be written in @code{ASCII}, 1498except for parts, like comments and string constants, which are written 1499using another charset than @code{ASCII}. When @var{language} is @samp{c}, 1500the recoding will proceed only with the contents of comments or strings, 1501while everything else will be copied without recoding. When @var{language} 1502is @samp{po}, the recoding will proceed only within translator comments 1503(those having whitespace immediately following the initial @samp{#}) 1504and with the contents of @code{msgstr} strings. 1505 1506For the above things to work, the non-@code{ASCII} encoding of the comment 1507or string should be such that an @code{ASCII} scan will successfully find 1508where the comment or string ends. 1509 1510Even if @code{ASCII} is the usual charset for writing programs, some 1511compilers are able to directly read other charsets, like @code{UTF-8}, say. 1512There is currently no provision in @code{recode} for reading mixed charset 1513sources which are not based on @code{ASCII}. It is probable that the need 1514for mixed recoding is not as pressing in such cases. 1515 1516For example, after one does: 1517 1518@example 1519recode -Spo pc/..u8 < @var{input}.po > @var{output}.po 1520@end example 1521 1522@noindent 1523file @file{@var{output}.po} holds a copy of @file{@var{input}.po} in which 1524@emph{only} translator comments and the contents of @code{msgstr} strings 1525have been recoded from the @code{IBM-PC} charset to pure @code{UTF-8}, 1526without attempting conversion of end-of-lines. Machine generated comments 1527and original @code{msgid} strings are not to be touched by this recoding. 1528 1529If @var{language} is not specified, @samp{c} is assumed. 1530@end table 1531 1532@node Emacs, Debugging, Mixed, Invoking recode 1533@section Using @code{recode} within Emacs 1534 1535The fact @code{recode} is a filter makes it quite easy to use from 1536within GNU Emacs. For example, recoding the whole buffer from 1537the @code{IBM-PC} charset to current charset (@w{@code{Latin-1}} on 1538Unix) is easily done with: 1539 1540@example 1541C-x h C-u M-| recode ibmpc RET 1542@end example 1543 1544@noindent 1545@samp{C-x h} selects the whole buffer, and @samp{C-u M-|} filters and 1546replaces the current region through the given shell command. Here is 1547another example, binding the keys @w{@samp{C-c T}} to the recoding of 1548the current region from Easy French to @w{@code{Latin-1}} (on Unix) and the key 1549@w{@samp{C-u C-c T}} from @w{@code{Latin-1}} (on Unix) to Easy French: 1550 1551@example 1552(global-set-key "\C-cT" 'recode-texte) 1553 1554(defun recode-texte (flag) 1555 (interactive "P") 1556 (shell-command-on-region 1557 (region-beginning) (region-end) 1558 (concat "recode " (if flag "..txte" "txte")) t) 1559 (exchange-point-and-mark)) 1560@end example 1561 1562@node Debugging, , Emacs, Invoking recode 1563@section Debugging considerations 1564 1565It is our experience that when @code{recode} does not provide satisfying 1566results, either @code{recode} was not called properly, correct results 1567raised some doubts nevertheless, or files to recode were somewhat mangled. 1568Genuine bugs are surely possible. 1569 1570Unless you already are a @code{recode} expert, it might be a good idea to 1571quickly revisit the tutorial (@pxref{Tutorial}) or the prior sections in this 1572chapter, to make sure that you properly formatted your recoding request. 1573In the case you intended to use @code{recode} as a filter, make sure that you 1574did not forget to redirect your standard input (through using the @kbd{<} 1575symbol in the shell, say). Some @code{recode} false mysteries are also 1576easily explained, @xref{Reversibility}. 1577 1578For the other cases, some investigation is needed. To illustrate how to 1579proceed, let's presume that you want to recode the @file{nicepage} file, 1580coded @code{UTF-8}, into @code{HTML}. The problem is that the command 1581@samp{recode u8..h nicepage} yields: 1582 1583@example 1584recode: Invalid input in step `UTF-8..ISO-10646-UCS-2' 1585@end example 1586 1587One good trick is to use @code{recode} in filter mode instead of in file 1588replacement mode, @xref{Synopsis}. Another good trick is to use the 1589@samp{-v} option asking for a verbose description of the recoding steps. 1590We could rewrite our recoding call as @samp{recode -v u8..h <nicepage}, 1591to get something like: 1592 1593@example 1594Request: UTF-8..:libiconv:..ISO-10646-UCS-2..HTML_4.0 1595Shrunk to: UTF-8..ISO-10646-UCS-2..HTML_4.0 1596[@dots{}@var{some output}@dots{}] 1597recode: Invalid input in step `UTF-8..ISO-10646-UCS-2' 1598@end example 1599 1600This might help you to better understand what the diagnostic means. The 1601recoding request is achieved in two steps, the first recodes @code{UTF-8} 1602into @code{UCS-2}, the second recodes @code{UCS-2} into @code{HTML}. 1603The problem occurs within the first of these two steps, and since, the 1604input of this step is the input file given to @code{recode}, this is 1605this overall input file which seems to be invalid. Also, when used in 1606filter mode, @code{recode} processes as much input as possible before the 1607error occurs and sends the result of this processing to standard output. 1608Since the standard output has not been redirected to a file, it is merely 1609displayed on the user screen. By inspecting near the end of the resulting 1610@code{HTML} output, that is, what was recoding a bit before the recoding 1611was interrupted, you may infer about where the error stands in the real 1612@code{UTF-8} input file. 1613 1614If you have the proper tools to examine the intermediate recoding data, 1615you might also prefer to reduce the problem to a single step to better 1616study it. This is what I usually do. For example, the last @code{recode} 1617call above is more or less equivalent to: 1618 1619@example 1620recode -v UTF-8..ISO_10646-UCS-2 <nicepage >temporary 1621recode -v ISO_10646-UCS-2..HTML_4.0 <temporary 1622rm temporary 1623@end example 1624 1625If you know that the problem is within the first step, you might prefer to 1626concentrate on using the first @code{recode} line. If you know that the 1627problem is within the second step, you might execute the first @code{recode} 1628line once and for all, and then play with the second @code{recode} call, 1629repeatedly using the @file{temporary} file created once by the first call. 1630 1631Note that the @samp{-f} switch may be used to force the production of 1632@code{HTML} output despite invalid input, it might be satisfying enough 1633for you, and easier than repairing the input file. That depends on how 1634strict you would like to be about the precision of the recoding process. 1635 1636If you later see that your HTML file begins with @samp{@@lt;html@@gt;} when 1637you expected @samp{<html>}, then @code{recode} might have done a bit more 1638that you wanted. In this case, your input file was half-@code{UTF-8}, 1639half-@code{HTML} already, that is, a mixed file (@pxref{Mixed}). There is a 1640special @code{-d} switch for this case. So, your might be end up calling 1641@samp{recode -fd nicepage}. Until you are quite sure that you accept 1642overwriting your input file whatever what, I recommend that you stick with 1643filter mode. 1644 1645If, after such experiments, you seriously think that the @code{recode} 1646program does not behave properly, there might be a genuine bug in the 1647program itself, in which case I invite you to to contribute a bug report, 1648@xref{Contributing}. 1649 1650@node Library, Universal, Invoking recode, Top 1651@chapter A recoding library 1652 1653@cindex recoding library 1654The program named @code{recode} is just an application of its recoding 1655library. The recoding library is available separately for other C 1656programs. A good way to acquire some familiarity with the recoding 1657library is to get acquainted with the @code{recode} program itself. 1658 1659To use the recoding library once it is installed, a C program needs to 1660have a line: 1661 1662@example 1663#include <recode.h> 1664@end example 1665 1666@noindent 1667near its beginning, and the user should have @samp{-lrecode} on the 1668linking call, so modules from the recoding library are found. 1669 1670The library is still under development. As it stands, it contains four 1671identifiable sets of routines: the outer level functions, the request 1672level functions, the task level functions and the charset level functions. 1673There are discussed in separate sections. 1674 1675For effectively using the recoding library in most applications, it should 1676be rarely needed to study anything beyond the main initialisation function 1677at outer level, and then, various functions at request level. 1678 1679@menu 1680* Outer level:: Outer level functions 1681* Request level:: Request level functions 1682* Task level:: Task level functions 1683* Charset level:: Charset level functions 1684* Errors:: Handling errors 1685@end menu 1686 1687@node Outer level, Request level, Library, Library 1688@section Outer level functions 1689 1690@cindex outer level functions 1691The outer level functions mainly prepare the whole recoding library for 1692use, or do actions which are unrelated to specific recodings. Here is 1693an example of a program which does not really make anything useful. 1694 1695@example 1696@group 1697#include <stdbool.h> 1698#include <recode.h> 1699 1700const char *program_name; 1701 1702int 1703main (int argc, char *const *argv) 1704@{ 1705 program_name = argv[0]; 1706 RECODE_OUTER outer = recode_new_outer (true); 1707 1708 recode_delete_outer (outer); 1709 exit (0); 1710@} 1711@end group 1712@end example 1713 1714@vindex RECODE_OUTER structure 1715The header file @code{<recode.h>} declares an opaque @code{RECODE_OUTER} 1716structure, which the programmer should use for allocating a variable in 1717his program (let's assume the programmer is a male, here, no prejudice 1718intended). This @samp{outer} variable is given as a first argument to 1719all outer level functions. 1720 1721@cindex @code{stdbool.h} header 1722@cindex @code{bool} data type 1723The @code{<recode.h>} header file uses the Boolean type setup by the 1724system header file @code{<stdbool.h>}. But this header file is still 1725fairly new in C standards, and likely does not exist everywhere. If you 1726system does not offer this system header file yet, the proper compilation 1727of the @code{<recode.h>} file could be guaranteed through the replacement 1728of the inclusion line by: 1729 1730@example 1731typedef enum @{false = 0, true = 1@} bool; 1732@end example 1733 1734People wanting wider portability, or Autoconf lovers, might arrange their 1735@file{configure.in} for being able to write something more general, like: 1736 1737@example 1738@group 1739#if STDC_HEADERS 1740# include <stdlib.h> 1741#endif 1742 1743/* Some systems do not define EXIT_*, even with STDC_HEADERS. */ 1744#ifndef EXIT_SUCCESS 1745# define EXIT_SUCCESS 0 1746#endif 1747#ifndef EXIT_FAILURE 1748# define EXIT_FAILURE 1 1749#endif 1750/* The following test is to work around the gross typo in systems like Sony 1751 NEWS-OS Release 4.0C, whereby EXIT_FAILURE is defined to 0, not 1. */ 1752#if !EXIT_FAILURE 1753# undef EXIT_FAILURE 1754# define EXIT_FAILURE 1 1755#endif 1756 1757#if HAVE_STDBOOL_H 1758# include <stdbool.h> 1759#else 1760typedef enum @{false = 0, true = 1@} bool; 1761#endif 1762 1763#include <recode.h> 1764 1765const char *program_name; 1766 1767int 1768main (int argc, char *const *argv) 1769@{ 1770 program_name = argv[0]; 1771 RECODE_OUTER outer = recode_new_outer (true); 1772 1773 recode_term_outer (outer); 1774 exit (EXIT_SUCCESS); 1775@} 1776@end group 1777@end example 1778 1779@noindent 1780but we will not insist on such details in the examples to come. 1781 1782@itemize @bullet 1783@item Initialisation functions 1784@cindex initialisation functions, outer 1785 1786@example 1787RECODE_OUTER recode_new_outer (@var{auto_abort}); 1788bool recode_delete_outer (@var{outer}); 1789@end example 1790 1791@findex recode_new_outer 1792@findex recode_delete_outer 1793The recoding library absolutely needs to be initialised before being used, 1794and @code{recode_new_outer} has to be called once, first. Besides the 1795@var{outer} it is meant to initialise, the function accepts a Boolean 1796argument whether or not the library should automatically issue diagnostics 1797on standard and abort the whole program on errors. When @var{auto_abort} 1798is @code{true}, the library later conveniently issues diagnostics itself, 1799and aborts the calling program on errors. This is merely a convenience, 1800because if this parameter was @code{false}, the calling program should always 1801take care of checking the return value of all other calls to the recoding 1802library functions, and when any error is detected, issue a diagnostic and 1803abort processing itself. 1804 1805Regardless of the setting of @var{auto_abort}, all recoding library 1806functions return a success status. Most functions are geared for returning 1807@code{false} for an error, and @code{true} if everything went fine. 1808Functions returning structures or strings return @code{NULL} instead 1809of the result, when the result cannot be produced. If @var{auto_abort} 1810is selected, functions either return @code{true}, or do not return at all. 1811 1812As in the example above, @code{recode_new_outer} is called only once in 1813most cases. Calling @code{recode_new_outer} implies some overhead, so 1814calling it more than once should preferably be avoided. 1815 1816The termination function @code{recode_delete_outer} reclaims the memory 1817allocated by @code{recode_new_outer} for a given @var{outer} variable. 1818Calling @code{recode_delete_outer} prior to program termination is more 1819aesthetic then useful, as all memory resources are automatically reclaimed 1820when the program ends. You may spare this terminating call if you prefer. 1821 1822@item The @code{program_name} declaration 1823 1824@cindex @code{program_name} variable 1825As we just explained, the user may set the @code{recode} library so that, 1826in case of problems error, it issues the diagnostic itself and aborts the 1827whole processing. This capability may be quite convenient. When this 1828feature is used, the aborting routine includes the name of the running 1829program in the diagnostic. On the other hand, when this feature is not 1830used, the library merely return error codes, giving the library user fuller 1831control over all this. This behaviour is more like what usual libraries 1832do: they return codes and never abort. However, I would rather not force 1833library users to necessarily check all return codes themselves, by leaving 1834no other choice. In most simple applications, letting the library diagnose 1835and abort is much easier, and quite welcome. This is precisely because 1836both possibilities exist that the @code{program_name} variable is needed: it 1837may be used by the library @emph{when} the user sets it to diagnose itself. 1838@end itemize 1839 1840@node Request level, Task level, Outer level, Library 1841@section Request level functions 1842 1843@cindex request level functions 1844The request level functions are meant to cover most recoding needs 1845programmers may have; they should provide all usual functionality. 1846Their API is almost stable by now. To get started with request level 1847functions, here is a full example of a program which sole job is to filter 1848@code{ibmpc} code on its standard input into @code{latin1} code on its 1849standard output. 1850 1851@example 1852@group 1853#include <stdio.h> 1854#include <stdbool.h> 1855#include <recode.h> 1856 1857const char *program_name; 1858 1859int 1860main (int argc, char *const *argv) 1861@{ 1862 program_name = argv[0]; 1863 RECODE_OUTER outer = recode_new_outer (true); 1864 RECODE_REQUEST request = recode_new_request (outer); 1865 bool success; 1866 1867 recode_scan_request (request, "ibmpc..latin1"); 1868 1869 success = recode_file_to_file (request, stdin, stdout); 1870 1871 recode_delete_request (request); 1872 recode_delete_outer (outer); 1873 1874 exit (success ? 0 : 1); 1875@} 1876@end group 1877@end example 1878 1879@vindex RECODE_REQUEST structure 1880The header file @code{<recode.h>} declares a @code{RECODE_REQUEST} structure, 1881which the programmer should use for allocating a variable in his program. 1882This @var{request} variable is given as a first argument to all request 1883level functions, and in most cases, may be considered as opaque. 1884 1885@itemize @bullet 1886@item Initialisation functions 1887@cindex initialisation functions, request 1888 1889@example 1890RECODE_REQUEST recode_new_request (@var{outer}); 1891bool recode_delete_request (@var{request}); 1892@end example 1893 1894@findex recode_new_request 1895@findex recode_delete_request 1896No @var{request} variable may not be used in other request level 1897functions of the recoding library before having been initialised by 1898@code{recode_new_request}. There may be many such @var{request} 1899variables, in which case, they are independent of one another and 1900they all need to be initialised separately. To avoid memory leaks, a 1901@var{request} variable should not be initialised a second time without 1902calling @code{recode_delete_request} to ``un-initialise'' it. 1903 1904Like for @code{recode_delete_outer}, calling @code{recode_delete_request} 1905prior to program termination, in the example above, may be left out. 1906 1907@item Fields of @code{struct recode_request} 1908@vindex recode_request structure 1909 1910Here are the fields of a @code{struct recode_request} which may be 1911meaningfully changed, once a @var{request} has been initialised by 1912@code{recode_new_request}, but before it gets used. It is not very frequent, 1913in practice, that these fields need to be changed. To access the fields, 1914you need to include @file{recodext.h} @emph{instead} of @file{recode.h}, 1915in which case there also is a greater chance that you need to recompile 1916your programs if a new version of the recoding library gets installed. 1917 1918@table @code 1919@item verbose_flag 1920@vindex verbose_flag 1921This field is initially @code{false}. When set to @code{true}, the 1922library will echo to stderr the sequence of elementary recoding steps 1923needed to achieve the requested recoding. 1924 1925@item diaeresis_char 1926@vindex diaeresis_char 1927This field is initially the ASCII value of a double quote @kbd{"}, 1928but it may also be the ASCII value of a colon @kbd{:}. In @code{texte} 1929charset, some countries use double quotes to mark diaeresis, while other 1930countries prefer colons. This field contains the diaeresis character 1931for the @code{texte} charset. 1932 1933@item make_header_flag 1934@vindex make_header_flag 1935This field is initially @code{false}. When set to @code{true}, it 1936indicates that the program is merely trying to produce a recoding table in 1937source form rather than completing any actual recoding. In such a case, 1938the optimisation of step sequence can be attempted much more aggressively. 1939If the step sequence cannot be reduced to a single step, table production 1940will fail. 1941 1942@item diacritics_only 1943@vindex diacritics_only 1944This field is initially @code{false}. For @code{HTML} and @code{LaTeX} 1945charset, it is often convenient to recode the diacriticized characters 1946only, while just not recoding other HTML code using ampersands or angular 1947brackets, or La@TeX{} code using backslashes. Set the field to @code{true} 1948for getting this behaviour. In the other charset, one can edit text as 1949well as HTML or La@TeX{} directives. 1950 1951@item ascii_graphics 1952@vindex ascii_graphics 1953This field is initially @code{false}, and relate to characters 176 to 1954223 in the @code{ibmpc} charset, which are use to draw boxes. When set 1955to @code{true}, while getting out of @code{ibmpc}, ASCII characters are 1956selected so to graphically approximate these boxes. 1957@end table 1958 1959@item Study of request strings 1960 1961@example 1962bool recode_scan_request (@var{request}, "@var{string}"); 1963@end example 1964 1965@findex recode_scan_request 1966The main role of a @var{request} variable is to describe a set of 1967recoding transformations. Function @code{recode_scan_request} studies 1968the given @var{string}, and stores an internal representation of it into 1969@var{request}. Note that @var{string} may be a full-fledged @code{recode} 1970request, possibly including surfaces specifications, intermediary 1971charsets, sequences, aliases or abbreviations (@pxref{Requests}). 1972 1973The internal representation automatically receives some pre-conditioning 1974and optimisation, so the @var{request} may then later be used many times 1975to achieve many actual recodings. It would not be efficient calling 1976@code{recode_scan_request} many times with the same @var{string}, it is 1977better having many @var{request} variables instead. 1978 1979@item Actual recoding jobs 1980 1981Once the @var{request} variable holds the description of a recoding 1982transformation, a few functions use it for achieving an actual recoding. 1983Either input or output of a recoding may be string, an in-memory buffer, 1984or a file. 1985 1986Functions with names like 1987@code{recode_@var{input-type}_to_@var{output-type}} request an actual 1988recoding, and are described below. It is easy to remember which arguments 1989each function accepts, once grasped some simple principles for each 1990possible @var{type}. However, one of the recoding function escapes these 1991principles and is discussed separately, first. 1992 1993@example 1994recode_string (@var{request}, @var{string}); 1995@end example 1996 1997@findex recode_string 1998The function @code{recode_string} recodes @var{string} according 1999to @var{request}, and directly returns the resulting recoded string 2000freshly allocated, or @code{NULL} if the recoding could not succeed for 2001some reason. When this function is used, it is the responsibility of 2002the programmer to ensure that the memory used by the returned string is 2003later reclaimed. 2004 2005@findex recode_string_to_buffer 2006@findex recode_string_to_file 2007@findex recode_buffer_to_buffer 2008@findex recode_buffer_to_file 2009@findex recode_file_to_buffer 2010@findex recode_file_to_file 2011@example 2012char *recode_string_to_buffer (@var{request}, 2013 @var{input_string}, 2014 &@var{output_buffer}, &@var{output_length}, &@var{output_allocated}); 2015bool recode_string_to_file (@var{request}, 2016 @var{input_file}, 2017 @var{output_file}); 2018bool recode_buffer_to_buffer (@var{request}, 2019 @var{input_buffer}, @var{input_length}, 2020 &@var{output_buffer}, &@var{output_length}, &@var{output_allocated}); 2021bool recode_buffer_to_file (@var{request}, 2022 @var{input_buffer}, @var{input_length}, 2023 @var{output_file}); 2024bool recode_file_to_buffer (@var{request}, 2025 @var{input_file}, 2026 &@var{output_buffer}, &@var{output_length}, &@var{output_allocated}); 2027bool recode_file_to_file (@var{request}, 2028 @var{input_file}, 2029 @var{output_file}); 2030@end example 2031 2032All these functions return a @code{bool} result, @code{false} meaning that 2033the recoding was not successful, often because of reversibility issues. 2034The name of the function well indicates on which types it reads and which 2035type it produces. Let's discuss these three types in turn. 2036 2037@table @asis 2038@item string 2039 2040A string is merely an in-memory buffer which is terminated by a @code{NUL} 2041character (using as many bytes as needed), instead of being described 2042by a byte length. For input, a pointer to the buffer is given through 2043one argument. 2044 2045It is notable that there is no @code{to_string} functions. Only one 2046function recodes into a string, and it is @code{recode_string}, which 2047has already been discussed separately, above. 2048 2049@item buffer 2050 2051A buffer is a sequence of bytes held in computer memory. For input, two 2052arguments provide a pointer to the start of the buffer and its byte size. 2053Note that for charsets using many bytes per character, the size is given 2054in bytes, not in characters. 2055 2056For output, three arguments provide the address of three variables, which 2057will receive the buffer pointer, the used buffer size in bytes, and the 2058allocated buffer size in bytes. If at the time of the call, the buffer 2059pointer is @code{NULL}, then the allocated buffer size should also be zero, 2060and the buffer will be allocated afresh by the recoding functions. However, 2061if the buffer pointer is not @code{NULL}, it should be already allocated, 2062the allocated buffer size then gives its size. If the allocated size 2063gets exceeded while the recoding goes, the buffer will be automatically 2064reallocated bigger, probably elsewhere, and the allocated buffer size will 2065be adjusted accordingly. 2066 2067The second variable, giving the in-memory buffer size, will receive the 2068exact byte size which was needed for the recoding. A @code{NUL} character 2069is guaranteed at the end of the produced buffer, but is not counted in the 2070byte size of the recoding. Beyond that @code{NUL}, there might be some 2071extra space after the recoded data, extending to the allocated buffer size. 2072 2073@item file 2074 2075@findex recode_filter_open@r{, not available} 2076@findex recode_filter_close@r{, not available} 2077A file is a sequence of bytes held outside computer memory, but 2078buffered through it. For input, one argument provides a pointer to a 2079file already opened for read. The file is then read and recoded from its 2080current position until the end of the file, effectively swallowing it in 2081memory if the destination of the recoding is a buffer. For reading a file 2082filtered through the recoding library, but only a little bit at a time, one 2083should rather use @code{recode_filter_open} and @code{recode_filter_close} 2084(these two functions are not yet available). 2085 2086For output, one argument provides a pointer to a file already opened 2087for write. The result of the recoding is written to that file starting 2088at its current position. 2089@end table 2090@end itemize 2091 2092@findex recode_format_table 2093The following special function is still subject to change: 2094 2095@example 2096void recode_format_table (@var{request}, @var{language}, "@var{name}"); 2097@end example 2098 2099@noindent 2100and is not documented anymore for now. 2101 2102@node Task level, Charset level, Request level, Library 2103@section Task level functions 2104@cindex task level functions 2105 2106The task level functions are used internally by the request level 2107functions, they allow more explicit control over files and memory 2108buffers holding input and output to recoding processes. The interface 2109specification of task level functions is still subject to change a bit. 2110 2111To get started with task level functions, here is a full example of a 2112program which sole job is to filter @code{ibmpc} code on its standard input 2113into @code{latin1} code on its standard output. That is, this program has 2114the same goal as the one from the previous section, but does its things 2115a bit differently. 2116 2117@example 2118@group 2119#include <stdio.h> 2120#include <stdbool.h> 2121#include <recodext.h> 2122 2123const char *program_name; 2124 2125int 2126main (int argc, char *const *argv) 2127@{ 2128 program_name = argv[0]; 2129 RECODE_OUTER outer = recode_new_outer (false); 2130 RECODE_REQUEST request = recode_new_request (outer); 2131 RECODE_TASK task; 2132 bool success; 2133 2134 recode_scan_request (request, "ibmpc..latin1"); 2135 2136 task = recode_new_task (request); 2137 task->input.file = ""; 2138 task->output.file = ""; 2139 success = recode_perform_task (task); 2140 2141 recode_delete_task (task); 2142 recode_delete_request (request); 2143 recode_delete_outer (outer); 2144 2145 exit (success ? 0 : 1); 2146@} 2147@end group 2148@end example 2149 2150@vindex RECODE_TASK structure 2151The header file @code{<recode.h>} declares a @code{RECODE_TASK} 2152structure, which the programmer should use for allocating a variable in 2153his program. This @code{task} variable is given as a first argument to 2154all task level functions. The programmer ought to change and possibly 2155consult a few fields in this structure, using special functions. 2156 2157@itemize @bullet 2158@item Initialisation functions 2159@cindex initialisation functions, task 2160 2161@findex recode_new_task 2162@findex recode_delete_task 2163@example 2164RECODE_TASK recode_new_task (@var{request}); 2165bool recode_delete_task (@var{task}); 2166@end example 2167 2168No @var{task} variable may be used in other task level functions 2169of the recoding library without having first been initialised with 2170@code{recode_new_task}. There may be many such @var{task} variables, 2171in which case, they are independent of one another and they all need to be 2172initialised separately. To avoid memory leaks, a @var{task} variable should 2173not be initialised a second time without calling @code{recode_delete_task} to 2174``un-initialise'' it. This function also accepts a @var{request} argument 2175and associates the request to the task. In fact, a task is essentially 2176a set of recoding transformations with the specification for its current 2177input and its current output. 2178 2179The @var{request} variable may be scanned before or after the call to 2180@code{recode_new_task}, it does not matter so far. Immediately after 2181initialisation, before further changes, the @var{task} variable associates 2182@var{request} empty in-memory buffers for both input and output. 2183The output buffer will later get allocated automatically on the fly, 2184as needed, by various task processors. 2185 2186Even if a call to @code{recode_delete_task} is not strictly mandatory 2187before ending the program, it is cleaner to always include it. Moreover, 2188in some future version of the recoding library, it might become required. 2189 2190@item Fields of @code{struct task_request} 2191@vindex task_request structure 2192 2193Here are the fields of a @code{struct task_request} which may be meaningfully 2194changed, once a @var{task} has been initialised by @code{recode_new_task}. 2195In fact, fields are expected to change. Once again, to access the fields, 2196you need to include @file{recodext.h} @emph{instead} of @file{recode.h}, 2197in which case there also is a greater chance that you need to recompile 2198your programs if a new version of the recoding library gets installed. 2199 2200@table @code 2201@item request 2202 2203The field @code{request} points to the current recoding request, but may 2204be changed as needed between recoding calls, for example when there is 2205a need to achieve the construction of a resulting text made up of many 2206pieces, each being recoded differently. 2207 2208@item input.name 2209@itemx input.file 2210 2211If @code{input.name} is not @code{NULL} at start of a recoding, this is 2212a request that a file by that name be first opened for reading and later 2213automatically closed once the whole file has been read. If the file name is 2214not @code{NULL} but an empty string, it means that standard input is to 2215be used. The opened file pointer is then held into @code{input.file}. 2216 2217If @code{input.name} is @code{NULL} and @code{input.file} is not, than 2218@code{input.file} should point to a file already opened for read, which 2219is meant to be recoded. 2220 2221@item input.buffer 2222@itemx input.cursor 2223@itemx input.limit 2224 2225When both @code{input.name} and @code{input.file} are @code{NULL}, three 2226pointers describe an in-memory buffer containing the text to be recoded. 2227The buffer extends from @code{input.buffer} to @code{input.limit}, 2228yet the text to be recoded only extends from @code{input.cursor} to 2229@code{input.limit}. In most situations, @code{input.cursor} starts with 2230the value that @code{input.buffer} has. (Its value will internally advance 2231as the recoding goes, until it reaches the value of @code{input.limit}.) 2232 2233@item output.name 2234@itemx output.file 2235 2236If @code{output.name} is not @code{NULL} at start of a recoding, this 2237is a request that a file by that name be opened for write and later 2238automatically closed after the recoding is done. If the file name is 2239not @code{NULL} but an empty string, it means that standard output is to 2240be used. The opened file pointer is then held into @code{output.file}. 2241If several passes with intermediate files are needed to produce the 2242recoding, the @code{output.name} file is opened only for the final pass. 2243 2244If @code{output.name} is @code{NULL} and @code{output.file} is not, then 2245@code{output.file} should point to a file already opened for write, which 2246will receive the result of the recoding. 2247 2248@item output.buffer 2249@itemx output.cursor 2250@itemx output.limit 2251 2252When both @code{output.name} and @code{output.file} are @code{NULL}, three 2253pointers describe an in-memory buffer meant to receive the text, once it 2254is recoded. The buffer is already allocated from @code{output.buffer} 2255to @code{output.limit}. In most situations, @code{output.cursor} starts 2256with the value that @code{output.buffer} has. Once the recoding is done, 2257@code{output.cursor} will point at the next free byte in the buffer, 2258just after the recoded text, so another recoding could be called without 2259changing any of these three pointers, for appending new information to it. 2260The number of recoded bytes in the buffer is the difference between 2261@code{output.cursor} and @code{output.buffer}. 2262 2263Each time @code{output.cursor} reaches @code{output.limit}, the buffer 2264is reallocated bigger, possibly at a different location in memory, always 2265held up-to-date in @code{output.buffer}. It is still possible to call a 2266task level function with no output buffer at all to start with, in which 2267case all three fields should have @code{NULL} as a value. This is the 2268situation immediately after a call to @code{recode_new_task}. 2269 2270@item strategy 2271@vindex strategy 2272@vindex RECODE_STRATEGY_UNDECIDED 2273This field, which is of type @code{enum recode_sequence_strategy}, tells 2274how various recoding steps (passes) will be interconnected. Its initial 2275value is @code{RECODE_STRATEGY_UNDECIDED}, which is a constant defined in 2276the header file @file{<recodext.h>}. Other possible values are: 2277 2278@table @code 2279@item RECODE_SEQUENCE_IN_MEMORY 2280@vindex RECODE_SEQUENCE_IN_MEMORY 2281Keep intermediate recodings in memory. 2282@item RECODE_SEQUENCE_WITH_FILES 2283@vindex RECODE_SEQUENCE_WITH_FILES 2284Do not fork, use intermediate files. 2285@item RECODE_SEQUENCE_WITH_PIPE 2286@vindex RECODE_SEQUENCE_WITH_PIPE 2287Fork processes connected with @code{pipe(2)}. 2288@end table 2289 2290@c FIXME 2291The best for now is to leave this field alone, and let the recoding 2292library decide its strategy, as many combinations have not been tested yet. 2293 2294@item byte_order_mark 2295@vindex byte_order_mark 2296This field, which is preset to @code{true}, indicates that a byte order 2297mark is to be expected at the beginning of any canonical @code{UCS-2} 2298or @code{UTF-16} text, and that such a byte order mark should be also 2299produced for these charsets. 2300 2301@item fail_level 2302@vindex fail_level 2303This field, which is of type @code{enum recode_error} (@pxref{Errors}), 2304sets the error level at which task level functions should report a failure. 2305If an error being detected is equal or greater than @code{fail_level}, 2306the function will eventually return @code{false} instead of @code{true}. 2307The preset value for this field is @code{RECODE_NOT_CANONICAL}, that means 2308that if not reset to another value, the library will report failure on 2309@emph{any} error. 2310 2311@item abort_level 2312@vindex abort_level 2313@vindex RECODE_MAXIMUM_ERROR 2314This field, which is of type @code{enum recode_error} (@pxref{Errors}), sets 2315the error level at which task level functions should immediately interrupt 2316their processing. If an error being detected is equal or greater than 2317@code{abort_level}, the function returns immediately, but the returned 2318value (@code{true} or @code{false}) is still is decided from the setting 2319of @code{fail_level}, not @code{abort_level}. The preset value for this 2320field is @code{RECODE_MAXIMUM_ERROR}, that means that is not reset to 2321another value, the library will never interrupt a recoding task. 2322 2323@item error_so_far 2324@vindex error_so_far 2325This field, which is of type @code{enum recode_error} (@pxref{Errors}), 2326maintains the maximum error level met so far while the recoding task 2327was proceeding. The preset value is @code{RECODE_NO_ERROR}. 2328@end table 2329 2330@item Task execution 2331@cindex task execution 2332 2333@findex recode_perform_task 2334@findex recode_filter_open 2335@findex recode_filter_close 2336@example 2337recode_perform_task (@var{task}); 2338recode_filter_open (@var{task}, @var{file}); 2339recode_filter_close (@var{task}); 2340@end example 2341 2342The function @code{recode_perform_task} reads as much input as possible, 2343and recode all of it on prescribed output, given a properly initialised 2344@var{task}. 2345 2346Functions @code{recode_filter_open} and @code{recode_filter_close} are 2347only planned for now. They are meant to read input in piecemeal ways. 2348Even if functionality already exists informally in the library, it has 2349not been made available yet through such interface functions. 2350@end itemize 2351 2352@node Charset level, Errors, Task level, Library 2353@section Charset level functions 2354@cindex charset level functions 2355 2356@cindex internal functions 2357Many functions are internal to the recoding library. Some of them 2358have been made external and available, for the @code{recode} program 2359had to retain all its previous functionality while being transformed 2360into a mere application of the recoding library. These functions are 2361not really documented here for the time being, as we hope that many of 2362them will vanish over time. When this set of routines will stabilise, 2363it would be convenient to document them as an API for handling charset 2364names and contents. 2365 2366@findex find_charset 2367@findex list_all_charsets 2368@findex list_concise_charset 2369@findex list_full_charset 2370@example 2371RECODE_CHARSET find_charset (@var{name}, @var{cleaning-type}); 2372bool list_all_charsets (@var{charset}); 2373bool list_concise_charset (@var{charset}, @var{list-format}); 2374bool list_full_charset (@var{charset}); 2375@end example 2376 2377@node Errors, , Charset level, Library 2378@section Handling errors 2379@cindex error handling 2380@cindex handling errors 2381 2382@cindex error messages 2383The @code{recode} program, while using the @code{recode} library, needs to 2384control whether recoding problems are reported or not, and then reflect 2385these in the exit status. The program should also instruct the library 2386whether the recoding should be abruptly interrupted when an error is 2387met (so sparing processing when it is known in advance that a wrong 2388result would be discarded anyway), or if it should proceed nevertheless. 2389Here is how the library groups errors into levels, listed here in order 2390of increasing severity. 2391 2392@table @code 2393@item RECODE_NO_ERROR 2394@vindex RECODE_NO_ERROR 2395 2396No error was met on previous library calls. 2397 2398@item RECODE_NOT_CANONICAL 2399@vindex RECODE_NOT_CANONICAL 2400@cindex non canonical input, error message 2401 2402The input text was using one of the many alternative codings for some 2403phenomenon, but not the one @code{recode} would have canonically generated. 2404So, if the reverse recoding is later attempted, it would produce a text 2405having the same @emph{meaning} as the original text, yet not being byte 2406identical. 2407 2408For example, a @code{Base64} block in which end-of-lines appear elsewhere 2409that at every 76 characters is not canonical. An e-circumflex in @TeX{} 2410which is coded as @samp{\^@{e@}} instead of @samp{\^e} is not canonical. 2411 2412@item RECODE_AMBIGUOUS_OUTPUT 2413@vindex RECODE_AMBIGUOUS_OUTPUT 2414@cindex ambiguous output, error message 2415 2416It has been discovered that if the reverse recoding was attempted on 2417the text output by this recoding, we would not obtain the original text, 2418only because an ambiguity was generated by accident in the output text. 2419This ambiguity would then cause the wrong interpretation to be taken. 2420 2421Here are a few examples. If the @code{Latin-1} sequence @samp{e^} 2422is converted to Easy French and back, the result will be interpreted 2423as e-circumflex and so, will not reflect the intent of the original two 2424characters. Recoding an @code{IBM-PC} text to @code{Latin-1} and back, 2425where the input text contained an isolated @kbd{LF}, will have a spurious 2426@kbd{CR} inserted before the @kbd{LF}. 2427 2428Currently, there are many cases in the library where the production of 2429ambiguous output is not properly detected, as it is sometimes a difficult 2430problem to accomplish this detection, or to do it speedily. 2431 2432@item RECODE_UNTRANSLATABLE 2433@vindex RECODE_UNTRANSLATABLE 2434@cindex untranslatable input, error message 2435 2436One or more input character could not be recoded, because there is just 2437no representation for this character in the output charset. 2438 2439Here are a few examples. Non-strict mode often allows @code{recode} to 2440compute on-the-fly mappings for unrepresentable characters, but strict 2441mode prohibits such attribution of reversible translations: so strict 2442mode might often trigger such an error. Most @code{UCS-2} codes used to 2443represent Asian characters cannot be expressed in various Latin charsets. 2444 2445@item RECODE_INVALID_INPUT 2446@vindex RECODE_INVALID_INPUT 2447@cindex invalid input, error message 2448 2449The input text does not comply with the coding it is declared to hold. So, 2450there is no way by which a reverse recoding would reproduce this text, 2451because @code{recode} should never produce invalid output. 2452 2453Here are a few examples. In strict mode, @code{ASCII} text is not allowed 2454to contain characters with the eight bit set. @code{UTF-8} encodings 2455ought to be minimal@footnote{The minimality of an @code{UTF-8} encoding 2456is guaranteed on output, but currently, it is not checked on input.}. 2457 2458@item RECODE_SYSTEM_ERROR 2459@vindex RECODE_SYSTEM_ERROR 2460@cindex system detected problem, error message 2461 2462The underlying system reported an error while the recoding was going on, 2463likely an input/output error. 2464(This error symbol is currently unused in the library.) 2465 2466@item RECODE_USER_ERROR 2467@vindex RECODE_USER_ERROR 2468@cindex misuse of recoding library, error message 2469 2470The programmer or user requested something the recoding library is unable 2471to provide, or used the API wrongly. 2472(This error symbol is currently unused in the library.) 2473 2474@item RECODE_INTERNAL_ERROR 2475@vindex RECODE_INTERNAL_ERROR 2476@cindex internal recoding bug, error message 2477 2478Something really wrong, which should normally never happen, was detected 2479within the recoding library. This might be due to genuine bugs in the 2480library, or maybe due to un-initialised or overwritten arguments to 2481the API. 2482(This error symbol is currently unused in the library.) 2483 2484@item RECODE_MAXIMUM_ERROR 2485@vindex RECODE_MAXIMUM_ERROR 2486 2487This error code should never be returned, it is only internally used as 2488a sentinel for the list of all possible error codes. 2489@end table 2490 2491@cindex error level threshold 2492@cindex threshold for error reporting 2493One should be able to set the error level threshold for returning failure 2494at end of recoding, and also the threshold for immediate interruption. 2495If many errors occur while the recoding proceed, which are not severe 2496enough to interrupt the recoding, then the most severe error is retained, 2497while others are forgotten@footnote{Another approach would have been 2498to define the level symbols as masks instead, and to give masks to 2499threshold setting routines, and to retain all errors---yet I never 2500met myself such a need in practice, and so I fear it would be overkill. 2501On the other hand, it might be interesting to maintain counters about 2502how many times each kind of error occurred.}. So, in case of an error, 2503the possible actions currently are: 2504 2505@itemize @bullet 2506@item do nothing and let go, returning success at end of recoding, 2507@item just let go for now, but return failure at end of recoding, 2508@item interrupt recoding right away and return failure now. 2509@end itemize 2510 2511@noindent 2512@xref{Task level}, and particularly the description of the fields 2513@code{fail_level}, @code{abort_level} and @code{error_so_far}, for more 2514information about how errors are handled. 2515 2516@ignore 2517@c FIXME: Take a look at these matters, indeed. 2518 2519A last topic around errors is where the error related fields are kept. 2520To work nicely with threads, my feeling is that the main API levels (based 2521on either of @code{struct recode_outer}, @code{struct recode_request} 2522or @code{struct recode_task}) should each have their error thresholds, 2523values, and last explicit message strings. Thresholds would be inherited 2524by requests from outers, and by tasks from requests. Error values and 2525strings would be automatically propagated out from tasks to requests, 2526for these request level routines which internally set up and use recoding 2527tasks. 2528 2529One simple way to avoid locking while sparing the initialisation of many 2530identical requests, a programmer could prepare the common request before 2531splitting threads, and merely @emph{copy} the @code{struct recode_request} 2532so each thread has its own copy---either using a mere assignment or 2533@code{memcpy}. The same could be said for @code{struct recode_outer} 2534or @code{struct recode_task} blocks, yet it makes less sense to me to do 2535so in practice. 2536@end ignore 2537 2538@node Universal, libiconv, Library, Top 2539@chapter The universal charset 2540 2541@cindex ISO 10646 2542Standard @w{ISO 10646} defines a universal character set, intended to encompass 2543in the long run all languages written on this planet. It is based on 2544wide characters, and offer possibilities for two billion characters 2545(@math{2^31}). 2546 2547@tindex UCS 2548This charset was to become available in @code{recode} under the name 2549@code{UCS}, with many external surfaces for it. But in the current 2550version, only surfaces of @code{UCS} are offered, each presented as a 2551genuine charset rather than a surface. Such surfaces are only meaningful 2552for the @code{UCS} charset, so it is not that useful to draw a line 2553between the surfaces and the only charset to which they may apply. 2554 2555@tindex UTF-1 2556@code{UCS} stands for Universal Character Set. @code{UCS-2} and 2557@code{UCS-4} are fixed length encodings, using two or four bytes per 2558character respectively. @code{UTF} stands for @code{UCS} Transformation 2559Format, and are variable length encodings dedicated to @code{UCS}. 2560@code{UTF-1} was based on @w{ISO 2022}, it did not succeed@footnote{It is not 2561probable that @code{recode} will ever support @code{UTF-1}.}. @code{UTF-2} 2562replaced it, it has been called @code{UTF-FSS} (File System Safe) in 2563Unicode or Plan9 context, but is better known today as @code{UTF-8}. 2564To complete the picture, there is @code{UTF-16} based on 16 bits bytes, 2565and @code{UTF-7} which is meant for transmissions limited to 7-bit bytes. 2566Most often, one might see @code{UTF-8} used for external storage, and 2567@code{UCS-2} used for internal storage. 2568 2569@c FIXME: the manual never explains what the U+NNNN notation means! 2570When @code{recode} is producing any representation of @code{UCS}, 2571it uses the replacement character @code{U+FFFD} for any @emph{valid} 2572character which is not representable in the goal charset@footnote{This 2573is when the goal charset allows for 16-bits. For shorter charsets, 2574the @samp{--strict} (@samp{-s}) option decides what happens: either the 2575character is dropped, or a reversible mapping is produced on the fly.}. 2576This happens, for example, when @code{UCS-2} is not capable to echo a 2577wide @code{UCS-4} character, or for a similar reason, an @code{UTF-8} 2578sequence using more than three bytes. The replacement character is 2579meant to represent an existing character. So, it is never produced to 2580represent an invalid sequence or ill-formed character in the input text. 2581In such cases, @code{recode} just gets rid of the noise, while taking note 2582of the error in its usual ways. 2583 2584Even if @code{UTF-8} is an encoding, really, it is the encoding of a single 2585character set, and nothing else. It is useful to distinguish between an 2586encoding (a @emph{surface} within @code{recode}) and a charset, but only 2587when the surface may be applied to several charsets. Specifying a charset 2588is a bit simpler than specifying a surface in a @code{recode} request. 2589There would not be a practical advantage at imposing a more complex syntax 2590to @code{recode} users, when it is simple to assimilate @code{UTF-8} to 2591a charset. Similar considerations apply for @code{UCS-2}, @code{UCS-4}, 2592@code{UTF-16} and @code{UTF-7}. These are all considered to be charsets. 2593 2594@menu 2595* UCS-2:: Universal Character Set, 2 bytes 2596* UCS-4:: Universal Character Set, 4 bytes 2597* UTF-7:: Universal Transformation Format, 7 bits 2598* UTF-8:: Universal Transformation Format, 8 bits 2599* UTF-16:: Universal Transformation Format, 16 bits 2600* count-characters:: Frequency count of characters 2601* dump-with-names:: Fully interpreted UCS dump 2602@end menu 2603 2604@node UCS-2, UCS-4, Universal, Universal 2605@section Universal Character Set, 2 bytes 2606 2607@tindex UCS-2 2608@cindex Unicode 2609One surface of @code{UCS} is usable for the subset defined by its first 2610sixty thousand characters (in fact, @math{31 * 2^11} codes), and uses 2611exactly two bytes per character. It is a mere dump of the internal 2612memory representation which is @emph{natural} for this subset and as such, 2613conveys with it endianness problems. 2614 2615@cindex byte order mark 2616A non-empty @code{UCS-2} file normally begins with a so called @dfn{byte 2617order mark}, having value @code{0xFEFF}. The value @code{0xFFFE} is not an 2618@code{UCS} character, so if this value is seen at the beginning of a file, 2619@code{recode} reacts by swapping all pairs of bytes. The library also 2620properly reacts to other occurrences of @code{0xFEFF} or @code{0xFFFE} 2621elsewhere than at the beginning, because concatenation of @code{UCS-2} 2622files should stay a simple matter, but it might trigger a diagnostic 2623about non canonical input. 2624 2625By default, when producing an @code{UCS-2} file, @code{recode} always 2626outputs the high order byte before the low order byte. But this could be 2627easily overridden through the @code{21-Permutation} surface 2628(@pxref{Permutations}). For example, the command: 2629 2630@example 2631recode u8..u2/21 < @var{input} > @var{output} 2632@end example 2633 2634@noindent 2635asks for an @code{UTF-8} to @code{UCS-2} conversion, with swapped byte 2636output. 2637 2638@tindex ISO-10646-UCS-2, and aliases 2639@tindex BMP 2640@tindex rune 2641@tindex u2 2642Use @code{UCS-2} as a genuine charset. This charset is available in 2643@code{recode} under the name @code{ISO-10646-UCS-2}. Accepted aliases 2644are @code{UCS-2}, @code{BMP}, @code{rune} and @code{u2}. 2645 2646@tindex combined-UCS-2 2647@cindex combining characters 2648The @code{recode} library is able to combine @code{UCS-2} some sequences 2649of codes into single code characters, to represent a few diacriticized 2650characters, ligatures or diphtongs which have been included to ease 2651mapping with other existing charsets. It is also able to explode 2652such single code characters into the corresponding sequence of codes. 2653The request syntax for triggering such operations is rudimentary and 2654temporary. The @code{combined-UCS-2} pseudo character set is a special 2655form of @code{UCS-2} in which known combinings have been replaced by the 2656simpler code. Using @code{combined-UCS-2} instead of @code{UCS-2} in an 2657@emph{after} position of a request forces a combining step, while using 2658@code{combined-UCS-2} instead of @code{UCS-2} in a @emph{before} position 2659of a request forces an exploding step. For the time being, one has to 2660resort to advanced request syntax to achieve other effects. For example: 2661 2662@example 2663recode u8..co,u2..u8 < @var{input} > @var{output} 2664@end example 2665 2666@noindent 2667copies an @code{UTF-8} @var{input} over @var{output}, still to be in 2668@code{UTF-8}, yet merging combining characters into single codes whenever 2669possible. 2670 2671@node UCS-4, UTF-7, UCS-2, Universal 2672@section Universal Character Set, 4 bytes 2673 2674@tindex UCS-4 2675Another surface of @code{UCS} uses exactly four bytes per character, and is 2676a mere dump of the internal memory representation which is @emph{natural} 2677for the whole charset and as such, conveys with it endianness problems. 2678 2679@tindex ISO-10646-UCS-4, and aliases 2680@tindex ISO_10646 2681@tindex 10646 2682@tindex u4 2683Use it as a genuine charset. This charset is available in @code{recode} 2684under the name @code{ISO-10646-UCS-4}. Accepted aliases are @code{UCS}, 2685@code{UCS-4}, @code{ISO_10646}, @code{10646} and @code{u4}. 2686 2687@node UTF-7, UTF-8, UCS-4, Universal 2688@section Universal Transformation Format, 7 bits 2689 2690@tindex UTF-7 2691@code{UTF-7} comes from IETF rather than ISO, and is described by @w{RFC 26922152}, in the MIME series. The @code{UTF-7} encoding is meant to fit 2693@code{UCS-2} over channels limited to seven bits per byte. It proceeds 2694from a mix between the spirit of @code{Quoted-Printable} and methods of 2695@code{Base64}, adapted to Unicode contexts. 2696 2697@tindex UNICODE-1-1-UTF-7, and aliases 2698@tindex TF-7 2699@tindex u7 2700This charset is available in @code{recode} under the name 2701@code{UNICODE-1-1-UTF-7}. Accepted aliases are @code{UTF-7}, @code{TF-7} 2702and @code{u7}. 2703 2704@node UTF-8, UTF-16, UTF-7, Universal 2705@section Universal Transformation Format, 8 bits 2706 2707@tindex UTF-8 2708Even if @code{UTF-8} does not originally come from IETF, there is now 2709@w{RFC 2279} to describe it. In letters sent on 1995-01-21 and 1995-04-20, 2710Markus Kuhn writes: 2711 2712@quotation 2713@code{UTF-8} is an @code{ASCII} compatible multi-byte encoding of the @w{ISO 271410646} universal character set (@code{UCS}). @code{UCS} is a 31-bit superset 2715of all other character set standards. The first 256 characters of @code{UCS} 2716are identical to those of @w{ISO 8859-1} (@w{Latin-1}). The @code{UCS-2} 2717encoding of UCS is a sequence of bigendian 16-bit words, the @code{UCS-4} 2718encoding is a sequence of bigendian 32-bit words. The @code{UCS-2} subset 2719of @w{ISO 10646} is also known as ``Unicode''. As both @code{UCS-2} 2720and @code{UCS-4} require heavy modifications to traditional @code{ASCII} 2721oriented system designs (e.g. Unix), the @code{UTF-8} encoding has been 2722designed for these applications. 2723 2724In @code{UTF-8}, only @code{ASCII} characters are encoded using bytes 2725below 128. All other non-ASCII characters are encoded as multi-byte 2726sequences consisting only of bytes in the range 128-253. This avoids 2727critical bytes like @kbd{NUL} and @kbd{/} in @code{UTF-8} strings, which 2728makes the @code{UTF-8} encoding suitable for being handled by the standard 2729C string library and being used in Unix file names. Other properties 2730include the preserved lexical sorting order and that @code{UTF-8} allows 2731easy self-synchronisation of software receiving @code{UTF-8} strings. 2732@end quotation 2733 2734@code{UTF-8} is the most common external surface of @code{UCS}, each 2735character uses from one to six bytes, and is able to encode all @math{2^31} 2736characters of the @code{UCS}. It is implemented as a charset, with the 2737following properties: 2738 2739@itemize @bullet 2740@item 2741Strict 7-bit @code{ASCII} is completely invariant under @code{UTF-8}, 2742and those are the only one-byte characters. @code{UCS} values and 2743@code{ASCII} values coincide. No multi-byte characters ever contain bytes 2744less than 128. @code{NUL} @emph{is} @code{NUL}. A multi-byte character 2745always starts with a byte of 192 or more, and is always followed by a 2746number of bytes between 128 to 191. That means that you may read at 2747random on disk or memory, and easily discover the start of the current, 2748next or previous character. You can count, skip or extract characters 2749with this only knowledge. 2750 2751@item 2752If you read the first byte of a multi-byte character in binary, it contains 2753many @samp{1} bits in successions starting with the most significant one 2754(from the left), at least two. The length of this @samp{1} sequence equals 2755the byte size of the character. All succeeding bytes start by @samp{10}. 2756This is a lot of redundancy, making it fairly easy to guess that a file 2757is valid @code{UTF-8}, or to safely state that it is not. 2758 2759@item 2760In a multi-byte character, if you remove all leading @samp{1} bits of the 2761first byte of a multi-byte character, and the initial @samp{10} bits of 2762all remaining bytes (so keeping 6 bits per byte for those), the remaining 2763bits concatenated are the UCS value. 2764@end itemize 2765 2766@noindent 2767These properties also have a few nice consequences: 2768 2769@itemize @bullet 2770@item 2771Conversion to/from values is algorithmically simple, and reasonably speedy. 2772 2773@item 2774A sequence of @var{N} bytes can hold characters needing up to 2 + 5@var{N} 2775bits in their @code{UCS} representation. Here, @var{N} is a number between 27761 and 6. So, @code{UTF-8} is most economical when mapping ASCII (1 byte), 2777followed by @code{UCS-2} (1 to 3 bytes) and @code{UCS-4} (1 to 6 bytes). 2778 2779@item 2780The lexicographic sorting order of @code{UCS} strings is preserved. 2781 2782@item 2783Bytes with value 254 or 255 never appear, and because of that, these are 2784sometimes used when escape mechanisms are needed. 2785@end itemize 2786 2787In some case, when little processing is done on a lot of strings, one may 2788choose for efficiency reasons to handle @code{UTF-8} strings directly even 2789if variable length, as it is easy to get start of characters. Character 2790insertion or replacement might require moving the remainder of the string 2791in either direction. In most cases, it is faster and easier to convert 2792from @code{UTF-8} to @code{UCS-2} or @code{UCS-4} prior to processing. 2793 2794@tindex UTF-8, aliases 2795@tindex UTF-FSS 2796@tindex FSS_UTF 2797@tindex TF-8 2798@tindex u8 2799This charset is available in @code{recode} under the name @code{UTF-8}. 2800Accepted aliases are @code{UTF-2}, @code{UTF-FSS}, @code{FSS_UTF}, 2801@code{TF-8} and @code{u8}. 2802 2803@node UTF-16, count-characters, UTF-8, Universal 2804@section Universal Transformation Format, 16 bits 2805 2806@tindex UTF-16, and aliases 2807Another external surface of @code{UCS} is also variable length, each 2808character using either two or four bytes. It is usable for the subset 2809defined by the first million characters (@math{17 * 2^16}) of @code{UCS}. 2810 2811Martin J. D@"urst writes (to @uref{comp.std.internat}, on 1995-03-28): 2812 2813@quotation 2814@code{UTF-16} is another method that reserves two times 1024 codepoints in 2815Unicode and uses them to index around one million additional characters. 2816@code{UTF-16} is a little bit like former multibyte codes, but quite 2817not so, as both the first and the second 16-bit code clearly show what 2818they are. The idea is that one million codepoints should be enough for 2819all the rare Chinese ideograms and historical scripts that do not fit 2820into the Base Multilingual Plane of @w{ISO 10646} (with just about 63,000 2821positions available, now that 2,000 are gone). 2822@end quotation 2823 2824@tindex Unicode, an alias for UTF-16 2825@tindex TF-16 2826@tindex u6 2827This charset is available in @code{recode} under the name @code{UTF-16}. 2828Accepted aliases are @code{Unicode}, @code{TF-16} and @code{u6}. 2829 2830@node count-characters, dump-with-names, UTF-16, Universal 2831@section Frequency count of characters 2832 2833@tindex count-characters 2834@cindex counting characters 2835A device may be used to obtain a list of characters in a file, and how many 2836times each character appears. Each count is followed by the @code{UCS-2} 2837value of the character and, when known, the @w{RFC 1345} mnemonic for that 2838character. 2839 2840This charset is available in @code{recode} under the name 2841@code{count-characters}. 2842 2843This @code{count} feature has been implemented as a charset. This may 2844change in some later version, as it would sometimes be convenient to count 2845original bytes, instead of their @code{UCS-2} equivalent. 2846 2847@node dump-with-names, , count-characters, Universal 2848@section Fully interpreted UCS dump 2849 2850@tindex dump-with-names 2851@cindex dumping characters, with description 2852@cindex character streams, description 2853@cindex description of individual characters 2854Another device may be used to get fully interpreted dumps of an @code{UCS-2} 2855stream of characters, with one @code{UCS-2} character displayed on a full 2856output line. Each line receives the @w{RFC 1345} mnemonic for the character 2857if it exists, the @code{UCS-2} value of the character, and a descriptive 2858comment for that character. As each input character produces its own 2859output line, beware that the output file from this conversion may be much, 2860much bigger than the input file. 2861 2862This charset is available in @code{recode} under the name 2863@code{dump-with-names}. 2864 2865This @code{dump-with-names} feature has been implemented as a charset rather 2866than a surface. This is surely debatable. The current implementation 2867allows for dumping charsets other than @code{UCS-2}. For example, the 2868command @w{@samp{recode l2..full < @var{input}}} implies a necessary 2869conversion from @code{Latin-2} to @code{UCS-2}, as @code{dump-with-names} 2870is only connected out from @code{UCS-2}. In such cases, @code{recode} 2871does not display the original @code{Latin-2} codes in the dump, only the 2872corresponding @code{UCS-2} values. To give a simpler example, the command 2873 2874@example 2875echo 'Hello, world!' | recode us..dump 2876@end example 2877 2878@noindent 2879produces the following output: 2880 2881@example 2882UCS2 Mne Description 2883 28840048 H latin capital letter h 28850065 e latin small letter e 2886006C l latin small letter l 2887006C l latin small letter l 2888006F o latin small letter o 2889002C , comma 28900020 SP space 28910077 w latin small letter w 2892006F o latin small letter o 28930072 r latin small letter r 2894006C l latin small letter l 28950064 d latin small letter d 28960021 ! exclamation mark 2897000A LF line feed (lf) 2898@end example 2899 2900The descriptive comment is given in English and @code{ASCII}, 2901yet if the English description is not available but a French one is, then 2902the French description is given instead, using @code{Latin-1}. However, 2903if the @code{LANGUAGE} or @code{LANG} environment variable begins with 2904the letters @samp{fr}, then listing preference goes to French when both 2905descriptions are available. 2906 2907Here is another example. To get the long description of the code 237 in 2908@w{Latin-5} table, one may use the following command. 2909 2910@example 2911echo -n 237 | recode l5/d..dump 2912@end example 2913 2914@noindent 2915If your @code{echo} does not grok @samp{-n}, use @samp{echo 237\c} instead. 2916Here is how to see what Unicode @code{U+03C6} means, while getting rid of 2917the title lines. 2918 2919@example 2920echo -n 0x03C6 | recode u2/x2..dump | tail +3 2921@end example 2922 2923@node libiconv, Tabular, Universal, Top 2924@chapter The @code{iconv} library 2925 2926@cindex @code{iconv} library 2927@cindex library, @code{iconv} 2928@cindex @code{libiconv} 2929@cindex interface, with @code{iconv} library 2930@cindex Haible, Bruno 2931The @code{recode} library itself contains most code and tables from the 2932portable @code{iconv} library, written by Bruno Haible. In fact, many 2933capabilities of the @code{recode} library are duplicated because of this 2934merging, as the older @code{recode} and @code{iconv} libraries share many 2935charsets. We discuss, here, the issues related to this duplication, and 2936other peculiarities specific to the @code{iconv} library. The plan is to 2937remove duplications and better merge specificities, as @code{recode} evolves. 2938 2939As implemented, if a recoding request can be satisfied by the @code{recode} 2940library both with and without its @code{iconv} library part, it is likely 2941that the @code{iconv} library will be used. To sort out if the @code{iconv} 2942is indeed used of not, just use the @samp{-v} or @samp{--verbose} option, 2943@pxref{Recoding}. 2944 2945@tindex libiconv 2946The @code{:libiconv:} charset represents a conceptual pivot charset 2947within the @code{iconv} part of the @code{recode} library (in fact, 2948this pivot exists, but is not directly reachable). This charset has a 2949mere @code{:} (a colon) for an alias. It is not allowed to recode from 2950or to this charset directly. But when this charset is selected as an 2951intermediate, usually by automatic means, then the @code{iconv} part 2952of the @code{recode} library is called to handle the transformations. 2953By using an @samp{--ignore=:libiconv:} option on the @code{recode} call 2954or equivalently, but more simply, @samp{-x:}, @code{recode} is instructed 2955to fully avoid this charset as an intermediate, with the consequence that 2956the @code{iconv} part of the library is defeated. Consider these two calls: 2957 2958@example 2959recode l1..1250 < @var{input} > @var{output} 2960recode -x: l1..1250 < @var{input} > @var{output} 2961@end example 2962 2963@noindent 2964Both should transform @var{input} from @code{ISO-8859-1} to @code{CP1250} 2965on @var{output}. The first call uses the @code{iconv} part of the library, 2966while the second call avoids it. Whatever the path used, the results should 2967normally be identical. However, there might be observable differences. 2968Most of them might result from reversibility issues, as the @code{iconv} 2969engine, which the @code{recode} library directly uses for the time being, 2970does not address reversibility. Even if much less likely, some differences 2971might result from slight errors in the tables used, such differences should 2972then be reported as bugs. 2973 2974Other irregularities might be seen in the area of error detection and 2975recovery. The @code{recode} library usually tries to detect canonicity 2976errors in input, and production of ambiguous output, but the @code{iconv} 2977part of the library currently does not. Input is always validated, however. 2978The @code{recode} library may not always react properly when its @code{iconv} 2979part has no translation for a given character. 2980 2981Within a collection of names for a single charset, the @code{recode} 2982library distinguishes one of them as being the genuine charset name, 2983while the others are said to be aliases. When @code{recode} lists all 2984charsets, for example with the @samp{-l} or @samp{--list} option, the list 2985integrates all @code{iconv} library charsets. The selection of one of the 2986aliases as the genuine charset name is an artifact added by @code{recode}, 2987it does not come from @code{iconv}. Moreover, the @code{recode} library 2988dynamically resolves some conflicts when it initialises itself at runtime. 2989This might explain some discrepancies in the table below, as for what is 2990the genuine charset name. 2991 2992@include libiconv.texi 2993 2994@node Tabular, ASCII misc, libiconv, Top 2995@chapter Tabular sources (@w{RFC 1345}) 2996 2997@cindex RFC 1345 2998@cindex character mnemonics, documentation 2999@cindex @code{chset} tools 3000An important part of the tabular charset knowledge in @code{recode} 3001comes from @w{RFC 1345} or, alternatively, from the @code{chset} tools, 3002both maintained by Keld Simonsen. The @w{RFC 1345} document: 3003 3004@quotation 3005``Character Mnemonics & Character Sets'', K. Simonsen, Request for 3006Comments no. 1345, Network Working Group, June 1992. 3007@end quotation 3008 3009@noindent 3010@cindex deviations from RFC 1345 3011defines many character mnemonics and character sets. The @code{recode} 3012library implements most of @w{RFC 1345}, however: 3013 3014@itemize @bullet 3015@item 3016@tindex dk-us@r{, not recognised by }recode 3017@tindex us-dk@r{, not recognised by }recode 3018It does not recognise those charsets which overload character positions: 3019@code{dk-us} and @code{us-dk}. However, @xref{Mixed}. 3020 3021@item 3022@tindex ANSI_X3.110-1983@r{, not recognised by }recode 3023@tindex ISO_6937-2-add@r{, not recognised by }recode 3024@tindex T.101-G2@r{, not recognised by }recode 3025@tindex T.61-8bit@r{, not recognised by }recode 3026@tindex iso-ir-90@r{, not recognised by }recode 3027It does not recognise those charsets which combine two characters for 3028representing a third: @code{ANSI_X3.110-1983}, @code{ISO_6937-2-add}, 3029@code{T.101-G2}, @code{T.61-8bit}, @code{iso-ir-90} and 3030@code{videotex-suppl}. 3031 3032@item 3033@tindex GB_2312-80@r{, not recognised by }recode 3034@tindex JIS_C6226-1978@r{, not recognised by }recode 3035@tindex JIS_X0212-1990@r{, not recognised by }recode 3036@tindex KS_C_5601-1987@r{, not recognised by }recode 3037It does not recognise 16-bits charsets: @code{GB_2312-80}, 3038@code{JIS_C6226-1978}, @code{JIS_C6226-1983}, @code{JIS_X0212-1990} and 3039@code{KS_C_5601-1987}. 3040 3041@item 3042@tindex isoir91 3043@tindex isoir92 3044It interprets the charset @code{isoir91} as @code{NATS-DANO} (alias 3045@code{iso-ir-9-1}), @emph{not} as @code{JIS_C6229-1984-a} (alias 3046@code{iso-ir-91}). It also interprets the charset @code{isoir92} 3047as @code{NATS-DANO-ADD} (alias @code{iso-ir-9-2}), @emph{not} as 3048@code{JIS_C6229-1984-b} (alias @code{iso-ir-92}). It might be better 3049just avoiding these two alias names. 3050@end itemize 3051 3052Keld Simonsen @email{keld@@dkuug.dk} did most of @w{RFC 1345} himself, with 3053some funding from Danish Standards and Nordic standards (INSTA) project. 3054He also did the character set design work, with substantial input from 3055Olle Jaernefors. Keld typed in almost all of the tables, some have been 3056contributed. A number of people have checked the tables in various 3057ways. The RFC lists a number of people who helped. 3058 3059@cindex @code{recode}, and RFC 1345 3060Keld and the @code{recode} maintainer have an arrangement by which any new 3061discovered information submitted by @code{recode} users, about tabular 3062charsets, is forwarded to Keld, eventually merged into Keld's work, 3063and only then, reimported into @code{recode}. Neither the @code{recode} 3064program nor its library try to compete, nor even establish themselves as 3065an alternate or diverging reference: @w{RFC 1345} and its new drafts stay the 3066genuine source for most tabular information conveyed by @code{recode}. 3067Keld has been more than collaborative so far, so there is no reason that 3068we act otherwise. In a word, @code{recode} should be perceived as the 3069application of external references, but not as a reference in itself. 3070 3071@tindex RFC1345@r{, a charset, and its aliases} 3072@tindex 1345 3073@tindex mnemonic@r{, an alias for RFC1345 charset} 3074Internally, @w{RFC 1345} associates which each character an unambiguous 3075mnemonic of a few characters, taken from @w{ISO 646}, which is a minimal 3076ASCII subset of 83 characters. The charset made up by these mnemonics 3077is available in @code{recode} under the name @code{RFC1345}. It has 3078@code{mnemonic} and @code{1345} for aliases. As implemened, this charset 3079exactly corresponds to @code{mnemonic+ascii+38}, using @w{RFC 1345} 3080nomenclature. Roughly said, @w{ISO 646} characters represent themselves, 3081except for the ampersand (@kbd{&}) which appears doubled. A prefix of a 3082single ampersand introduces a mnemonic. For mnemonics using two characters, 3083the prefix is immediately by the mnemonic. For longer mnemonics, the prefix 3084is followed by an underline (@kbd{_}), the mmemonic, and another underline. 3085Conversions to this charset are usually reversible. 3086 3087Currently, @code{recode} does not offer any of the many other possible 3088variations of this family of representations. They will likely be 3089implemented in some future version, however. 3090 3091@table @code 3092@include rfc1345.texi 3093@end table 3094 3095@node ASCII misc, IBM and MS, Tabular, Top 3096@chapter ASCII and some derivatives 3097 3098@menu 3099* ASCII:: Usual ASCII 3100* ISO 8859:: ASCII extended by Latin Alphabets 3101* ASCII-BS:: ASCII 7-bits, @kbd{BS} to overstrike 3102* flat:: ASCII without diacritics nor underline 3103@end menu 3104 3105@node ASCII, ISO 8859, ASCII misc, ASCII misc 3106@section Usual ASCII 3107 3108@tindex ASCII@r{, an alias for the }ANSI_X3.4-1968@r{ charset} 3109@tindex ANSI_X3.4-1968@r{, and its aliases} 3110@tindex IBM367 3111@tindex US-ASCII 3112@tindex cp367 3113@tindex iso-ir-6 3114@tindex us 3115This charset is available in @code{recode} under the name @code{ASCII}. 3116In fact, it's true name is @code{ANSI_X3.4-1968} as per @w{RFC 1345}, 3117accepted aliases being @code{ANSI_X3.4-1986}, @code{ASCII}, 3118@code{IBM367}, @code{ISO646-US}, @code{ISO_646.irv:1991}, 3119@code{US-ASCII}, @code{cp367}, @code{iso-ir-6} and @code{us}. The 3120shortest way of specifying it in @code{recode} is @code{us}. 3121 3122@cindex ASCII table, recreating with @code{recode} 3123This documentation used to include ASCII tables. They have been removed 3124since the @code{recode} program can now recreate these easily: 3125 3126@example 3127recode -lf us for commented ASCII 3128recode -ld us for concise decimal table 3129recode -lo us for concise octal table 3130recode -lh us for concise hexadecimal table 3131@end example 3132 3133@node ISO 8859, ASCII-BS, ASCII, ASCII misc 3134@section ASCII extended by Latin Alphabets 3135 3136@cindex Latin charsets 3137There are many Latin charsets. The following has been written by Tim 3138Lasko @email{lasko@@video.dec.com}, a long while ago: 3139 3140@quotation 3141ISO @w{Latin-1}, or more completely ISO Latin Alphabet No 1, is now an 3142international standard as of February 1987 (IS 8859, Part 1). For those 3143American USEnet'rs that care, the 8-bit ASCII standard, which is essentially 3144the same code, is going through the final administrative processes prior 3145to publication. ISO @w{Latin-1} (IS 8859/1) is actually one of an entire 3146family of eight-bit one-byte character sets, all having ASCII on the left 3147hand side, and with varying repertoires on the right hand side: 3148 3149@itemize @bullet 3150@item 3151Latin Alphabet No 1 (caters to Western Europe - now approved). 3152@item 3153Latin Alphabet No 2 (caters to Eastern Europe - now approved). 3154@item 3155Latin Alphabet No 3 (caters to SE Europe + others - in draft ballot). 3156@item 3157Latin Alphabet No 4 (caters to Northern Europe - in draft ballot). 3158@item 3159Latin-Cyrillic alphabet (right half all Cyrillic - processing currently 3160suspended pending USSR input). 3161@item 3162Latin-Arabic alphabet (right half all Arabic - now approved). 3163@item 3164Latin-Greek alphabet (right half Greek + symbols - in draft ballot). 3165@item 3166Latin-Hebrew alphabet (right half Hebrew + symbols - proposed). 3167@end itemize 3168@end quotation 3169 3170@tindex Latin-1 3171The ISO Latin Alphabet 1 is available as a charset in @code{recode} under 3172the name @code{Latin-1}. In fact, it's true name is @code{ISO_8859-1:1987} 3173as per @w{RFC 1345}, accepted aliases being @code{CP819}, @code{IBM819}, 3174@code{ISO-8859-1}, @code{ISO_8859-1}, @code{iso-ir-100}, @code{l1} 3175and @code{Latin-1}. The shortest way of specifying it in @code{recode} 3176is @code{l1}. 3177 3178@cindex Latin-1 table, recreating with @code{recode} 3179It is an eight-bit code which coincides with ASCII for the lower half. 3180This documentation used to include @w{Latin-1} tables. They have been removed 3181since the @code{recode} program can now recreate these easily: 3182 3183@example 3184recode -lf l1 for commented ISO Latin-1 3185recode -ld l1 for concise decimal table 3186recode -lo l1 for concise octal table 3187recode -lh l1 for concise hexadecimal table 3188@end example 3189 3190@node ASCII-BS, flat, ISO 8859, ASCII misc 3191@section ASCII 7-bits, @kbd{BS} to overstrike 3192 3193@tindex ASCII-BS@r{, and its aliases} 3194@tindex BS@r{, an alias for }ASCII-BS@r{ charset} 3195This charset is available in @code{recode} under the name 3196@code{ASCII-BS}, with @code{BS} as an acceptable alias. 3197 3198@cindex diacritics, with @code{ASCII-BS} charset 3199The file is straight ASCII, seven bits only. According to the definition 3200of ASCII, diacritics are applied by a sequence of three characters: the 3201letter, one @kbd{BS}, the diacritic mark. We deviate slightly from this 3202by exchanging the diacritic mark and the letter so, on a screen device, the 3203diacritic will disappear and let the letter alone. At recognition time, 3204both methods are acceptable. 3205 3206The French quotes are coded by the sequences: @w{@kbd{< BS "}} or @w{@kbd{" 3207BS <}} for the opening quote and @w{@kbd{> BS "}} or @w{@kbd{" BS >}} 3208for the closing quote. This artificial convention was inherited in 3209straight @code{ASCII-BS} from habits around @code{Bang-Bang} entry, and 3210is not well known. But we decided to stick to it so that @code{ASCII-BS} 3211charset will not lose French quotes. 3212 3213The @code{ASCII-BS} charset is independent of @code{ASCII}, and 3214different. The following examples demonstrate this, knowing at advance 3215that @samp{!2} is the @code{Bang-Bang} way of representing an @kbd{e} 3216with an acute accent. Compare: 3217 3218@example 3219% echo \!2 | recode -v bang..l1/d 3220Request: Bang-Bang..ISO-8859-1/Decimal-1 3221233, 10 3222@end example 3223 3224@noindent 3225with: 3226 3227@example 3228% echo \!2 | recode -v bang..bs/d 3229Request: Bang-Bang..ISO-8859-1..ASCII-BS/Decimal-1 3230 39, 8, 101, 10 3231@end example 3232 3233In the first case, the @kbd{e} with an acute accent is merely 3234transmitted by the @code{Latin-1..ASCII} mapping, not having a special 3235recoding rule for it. In the @code{Latin-1..ASCII-BS} case, the acute 3236accent is applied over the @kbd{e} with a backspace: diacriticised 3237characters have special rules. For the @code{ASCII-BS} charset, 3238reversibility is still possible, but there might be difficult cases. 3239 3240@node flat, , ASCII-BS, ASCII misc 3241@section ASCII without diacritics nor underline 3242@tindex flat@r{, a charset} 3243 3244This charset is available in @code{recode} under the name @code{flat}. 3245 3246@cindex diacritics and underlines, removing 3247@cindex removing diacritics and underlines 3248This code is ASCII expunged of all diacritics and underlines, as long as 3249they are applied using three character sequences, with @kbd{BS} in the 3250middle. Also, despite slightly unrelated, each control character is 3251represented by a sequence of two or three graphic characters. The newline 3252character, however, keeps its functionality and is not represented. 3253 3254Note that charset @code{flat} is a terminal charset. We can convert 3255@emph{to} @code{flat}, but not @emph{from} it. 3256 3257@node IBM and MS, CDC, ASCII misc, Top 3258@chapter Some IBM or Microsoft charsets 3259 3260@cindex IBM codepages 3261@cindex codepages 3262The @code{recode} program provides various IBM or Microsoft code pages 3263(@pxref{Tabular}). An easy way to find them all at once out of the 3264@code{recode} program itself is through the command: 3265 3266@example 3267recode -l | egrep -i '(CP|IBM)[0-9]' 3268@end example 3269 3270@noindent 3271But also, see few special charsets presented in the incoming sections. 3272 3273@menu 3274* EBCDIC:: EBCDIC codes 3275* IBM-PC:: IBM's PC code 3276* Icon-QNX:: Unisys' Icon code 3277@end menu 3278 3279@node EBCDIC, IBM-PC, IBM and MS, IBM and MS 3280@section EBCDIC code 3281 3282@cindex EBCDIC charsets 3283This charset is the IBM's External Binary Coded Decimal for Interchange 3284Coding. This is an eight bits code. The following three variants were 3285implemented in @code{recode} independently of @w{RFC 1345}: 3286 3287@table @code 3288@item EBCDIC 3289@tindex EBCDIC@r{, a charset} 3290In @code{recode}, the @code{us..ebcdic} conversion is identical to @samp{dd 3291conv=ebcdic} conversion, and @code{recode} @code{ebcdic..us} conversion is 3292identical to @samp{dd conv=ascii} conversion. This charset also represents 3293the way Control Data Corporation relates EBCDIC to 8-bits ASCII. 3294 3295@item EBCDIC-CCC 3296@tindex EBCDIC-CCC 3297In @code{recode}, the @code{us..ebcdic-ccc} or @code{ebcdic-ccc..us} 3298conversions represent the way Concurrent Computer Corporation (formerly 3299Perkin Elmer) relates EBCDIC to 8-bits ASCII. 3300 3301@item EBCDIC-IBM 3302@tindex EBCDIC-IBM 3303In @code{recode}, the @code{us..ebcdic-ibm} conversion is @emph{almost} 3304identical to the GNU @samp{dd conv=ibm} conversion. Given the exact 3305@samp{dd conv=ibm} conversion table, @code{recode} once said: 3306 3307@example 3308Codes 91 and 213 both recode to 173 3309Codes 93 and 229 both recode to 189 3310No character recodes to 74 3311No character recodes to 106 3312@end example 3313 3314So I arbitrarily chose to recode 213 by 74 and 229 by 106. This makes the 3315@code{EBCDIC-IBM} recoding reversible, but this is not necessarily the best 3316correction. In any case, I think that GNU @code{dd} should be amended. 3317@code{dd} and @code{recode} should ideally agree on the same correction. 3318So, this table might change once again. 3319@end table 3320 3321@w{RFC 1345} brings into @code{recode} 15 other EBCDIC charsets, and 21 other 3322charsets having EBCDIC in at least one of their alias names. You can 3323get a list of all these by executing: 3324 3325@example 3326recode -l | grep -i ebcdic 3327@end example 3328 3329Note that @code{recode} may convert a pure stream of EBCDIC characters, 3330but it does not know how to handle binary data between records which 3331is sometimes used to delimit them and build physical blocks. If end of 3332lines are not marked, fixed record size may produce something readable, 3333but @code{VB} or @code{VBS} blocking is likely to yield some garbage in 3334the converted results. 3335 3336@node IBM-PC, Icon-QNX, EBCDIC, IBM and MS 3337@section IBM's PC code 3338 3339@tindex IBM-PC 3340@cindex MS-DOS charsets 3341@tindex MSDOS 3342@tindex dos 3343@tindex pc 3344This charset is available in @code{recode} under the name @code{IBM-PC}, 3345with @code{dos}, @code{MSDOS} and @code{pc} as acceptable aliases. 3346The shortest way of specifying it in @code{recode} is @code{pc}. 3347 3348The charset is aimed towards a PC microcomputer from IBM or any compatible. 3349This is an eight-bit code. This charset is fairly old in @code{recode}, 3350its tables were produced a long while ago by mere inspection of a printed 3351chart of the IBM-PC codes and glyph. 3352 3353It has @code{CR-LF} as its implied surface. This means that, if the original 3354end of lines have to be preserved while going out of @code{IBM-PC}, they 3355should currently be added back through the usage of a surface on the other 3356charset, or better, just never removed. Here are examples for both cases: 3357 3358@example 3359recode pc..l2/cl < @var{input} > @var{output} 3360recode pc/..l2 < @var{input} > @var{output} 3361@end example 3362 3363@w{RFC 1345} brings into @code{recode} 44 @samp{IBM} charsets or code pages, 3364and also 8 other code pages. You can get a list of these all these by 3365executing:@footnote{On DOS/Windows, stock shells do not know that apostrophes 3366quote special characters like @kbd{|}, so one need to use double quotes 3367instead of apostrophes.} 3368 3369@example 3370recode -l | egrep -i '(CP|IBM)[0-9]' 3371@end example 3372 3373@noindent 3374@cindex CR-LF surface, in IBM-PC charsets 3375@tindex IBM819@r{, and CR-LF surface} 3376All charset or aliases beginning with letters @samp{CP} or @samp{IBM} 3377also have @code{CR-LF} as their implied surface. The same is true for a 3378purely numeric alias in the same family. For example, all of @code{819}, 3379@code{CP819} and @code{IBM819} imply @code{CR-LF} as a surface. Note that 3380@code{ISO-8859-1} does @emph{not} imply a surface, despite it shares the 3381same tabular data as @code{819}. 3382 3383@tindex ibm437 3384There are a few discrepancies between this @code{IBM-PC} charset and the 3385very similar @w{RFC 1345} charset @code{ibm437}, which have not been analysed 3386yet, so the charsets are being kept separate for now. This might change in 3387the future, and the @code{IBM-PC} charset might disappear. Wizards would 3388be interested in comparing the output of these two commands: 3389 3390@example 3391recode -vh IBM-PC..Latin-1 3392recode -vh IBM437..Latin-1 3393@end example 3394 3395@noindent 3396The first command uses the charset prior to @w{RFC 1345} introduction. 3397Both methods give different recodings. These differences are annoying, 3398the fuzziness will have to be explained and settle down one day. 3399 3400@node Icon-QNX, , IBM-PC, IBM and MS 3401@section Unisys' Icon code 3402 3403@tindex Icon-QNX@r{, and aliases} 3404@tindex QNX@r{, an alias for a charset} 3405This charset is available in @code{recode} under the name 3406@code{Icon-QNX}, with @code{QNX} as an acceptable alias. 3407 3408The file is using Unisys' Icon way to represent diacritics with code 25 3409escape sequences, under the system QNX. This is a seven-bit code, even 3410if eight-bit codes can flow through as part of IBM-PC charset. 3411 3412@node CDC, Micros, IBM and MS, Top 3413@chapter Charsets for CDC machines 3414 3415@cindex CDC charsets 3416@cindex charsets for CDC machines 3417What is now @code{recode} evolved out, through many transformations 3418really, from a set of programs which were originally written in 3419@dfn{COMPASS}, Control Data Corporation's assembler, with bits in FORTRAN, 3420and later rewritten in CDC 6000 Pascal. The CDC heritage shows by the 3421fact some old CDC charsets are still supported. 3422 3423The @code{recode} author used to be familiar with CDC Scope-NOS/BE and 3424Kronos-NOS, and many CDC formats. Reading CDC tapes directly on other 3425machines is often a challenge, and @code{recode} does not always solve 3426it. It helps having tapes created in coded mode instead of binary mode, 3427and using @code{S} (Stranger) tapes instead of @code{I} (Internal) tapes. 3428ANSI labels and multi-file tapes might be the source of trouble. There are 3429ways to handle a few Cyber Record Manager formats, but some of them might 3430be quite difficult to decode properly after the transfer is done. 3431 3432The @code{recode} program is usable only for a small subset of NOS text 3433formats, and surely not with binary textual formats, like @code{UPDATE} 3434or @code{MODIFY} sources, for example. @code{recode} is not especially 3435suited for reading 8/12 or 56/60 packing, yet this could easily arranged 3436if there was a demand for it. It does not have the ability to translate 3437Display Code directly, as the ASCII conversion implied by tape drivers 3438or FTP does the initial approximation. @code{recode} can decode 6/12 3439caret notation over Display Code already mapped to ASCII. 3440 3441@menu 3442* Display Code:: Control Data's Display Code 3443* CDC-NOS:: ASCII 6/12 from NOS 3444* Bang-Bang:: ASCII ``bang bang'' 3445@end menu 3446 3447@node Display Code, CDC-NOS, CDC, CDC 3448@section Control Data's Display Code 3449 3450@cindex CDC Display Code, a table 3451This code is not available in @code{recode}, but repeated here for 3452reference. This is a 6-bit code used on CDC mainframes. 3453 3454@example 3455Octal display code to graphic Octal display code to octal ASCII 3456 345700 : 20 P 40 5 60 # 00 072 20 120 40 065 60 043 345801 A 21 Q 41 6 61 [ 01 101 21 121 41 066 61 133 345902 B 22 R 42 7 62 ] 02 102 22 122 42 067 62 135 346003 C 23 S 43 8 63 % 03 103 23 123 43 070 63 045 346104 D 24 T 44 9 64 " 04 104 24 124 44 071 64 042 346205 E 25 U 45 + 65 _ 05 105 25 125 45 053 65 137 346306 F 26 V 46 - 66 ! 06 106 26 126 46 055 66 041 346407 G 27 W 47 * 67 & 07 107 27 127 47 052 67 046 346510 H 30 X 50 / 70 ' 10 110 30 130 50 057 70 047 346611 I 31 Y 51 ( 71 ? 11 111 31 131 51 050 71 077 346712 J 32 Z 52 ) 72 < 12 112 32 132 52 051 72 074 346813 K 33 0 53 $ 73 > 13 113 33 060 53 044 73 076 346914 L 34 1 54 = 74 @@ 14 114 34 061 54 075 74 100 347015 M 35 2 55 75 \ 15 115 35 062 55 040 75 134 347116 N 36 3 56 , 76 ^ 16 116 36 063 56 054 76 136 347217 O 37 4 57 . 77 ; 17 117 37 064 57 056 77 073 3473@end example 3474 3475In older times, @kbd{:} used octal 63, and octal 0 was not a character. 3476The table above shows the ASCII glyph interpretation of codes 60 to 77, 3477yet these 16 codes were once defined differently. 3478 3479There is no explicit end of line in Display Code, and the Cyber Record 3480Manager introduced many new ways to represent them, the traditional end of 3481lines being reachable by setting @code{RT} to @samp{Z}. If 6-bit bytes 3482in a file are sequentially counted from 1, a traditional end of line 3483does exist if bytes 10*@var{n}+9 and 10@var{n}+10 are both zero for a 3484given @var{n}, in which case these two bytes are not to be interpreted as 3485@kbd{::}. Also, up to 9 immediately preceeding zero bytes, going backward, 3486are to be considered as part of the end of line and not interpreted as 3487@kbd{:}@footnote{This convention replaced an older one saying that up to 4 3488immediately preceeding @emph{pairs} of zero bytes, going backward, are to 3489be considered as part of the end of line and not interpreted as @kbd{::}.}. 3490 3491@node CDC-NOS, Bang-Bang, Display Code, CDC 3492@section ASCII 6/12 from NOS 3493 3494@tindex CDC-NOS@r{, and its aliases} 3495@tindex NOS 3496This charset is available in @code{recode} under the name 3497@code{CDC-NOS}, with @code{NOS} as an acceptable alias. 3498 3499@cindex NOS 6/12 code 3500@cindex caret ASCII code 3501This is one of the charsets in use on CDC Cyber NOS systems to represent 3502ASCII, sometimes named @dfn{NOS 6/12} code for coding ASCII. This code is 3503also known as @dfn{caret ASCII}. It is based on a six bits character set 3504in which small letters and control characters are coded using a @kbd{^} 3505escape and, sometimes, a @kbd{@@} escape. 3506 3507The routines given here presume that the six bits code is already expressed 3508in ASCII by the communication channel, with embedded ASCII @kbd{^} and 3509@kbd{@@} escapes. 3510 3511Here is a table showing which characters are being used to encode each 3512ASCII character. 3513 3514@example 3515000 ^5 020 ^# 040 060 0 100 @@A 120 P 140 @@G 160 ^P 3516001 ^6 021 ^[ 041 ! 061 1 101 A 121 Q 141 ^A 161 ^Q 3517002 ^7 022 ^] 042 " 062 2 102 B 122 R 142 ^B 162 ^R 3518003 ^8 023 ^% 043 # 063 3 103 C 123 S 143 ^C 163 ^S 3519004 ^9 024 ^" 044 $ 064 4 104 D 124 T 144 ^D 164 ^T 3520005 ^+ 025 ^_ 045 % 065 5 105 E 125 U 145 ^E 165 ^U 3521006 ^- 026 ^! 046 & 066 6 106 F 126 V 146 ^F 166 ^V 3522007 ^* 027 ^& 047 ' 067 7 107 G 127 W 147 ^G 167 ^W 3523010 ^/ 030 ^' 050 ( 070 8 110 H 130 X 150 ^H 170 ^X 3524011 ^( 031 ^? 051 ) 071 9 111 I 131 Y 151 ^I 171 ^Y 3525012 ^) 032 ^< 052 * 072 @@D 112 J 132 Z 152 ^J 172 ^Z 3526013 ^$ 033 ^> 053 + 073 ; 113 K 133 [ 153 ^K 173 ^0 3527014 ^= 034 ^@@ 054 , 074 < 114 L 134 \ 154 ^L 174 ^1 3528015 ^ 035 ^\ 055 - 075 = 115 M 135 ] 155 ^M 175 ^2 3529016 ^, 036 ^^ 056 . 076 > 116 N 136 @@B 156 ^N 176 ^3 3530017 ^. 037 ^; 057 / 077 ? 117 O 137 _ 157 ^O 177 ^4 3531@end example 3532 3533@node Bang-Bang, , CDC-NOS, CDC 3534@section ASCII ``bang bang'' 3535 3536@tindex Bang-Bang 3537This charset is available in @code{recode} under the name @code{Bang-Bang}. 3538 3539This code, in use on Cybers at Universit@'e de Montr@'eal mainly, served 3540to code a lot of French texts. The original name of this charset is 3541@dfn{ASCII cod@'e Display}. This code is also known as @dfn{Bang-bang}. 3542It is based on a six bits character set in which capitals, French 3543diacritics and a few others are coded using an @kbd{!} escape followed 3544by a single character, and control characters using a double @kbd{!} 3545escape followed by a single character. 3546 3547The routines given here presume that the six bits code is already expressed 3548in ASCII by the communication channel, with embedded ASCII @kbd{!} 3549escapes. 3550 3551Here is a table showing which characters are being used to encode each 3552ASCII character. 3553 3554@example 3555000 !!@@ 020 !!P 040 060 0 100 @@ 120 !P 140 !@@ 160 P 3556001 !!A 021 !!Q 041 !" 061 1 101 !A 121 !Q 141 A 161 Q 3557002 !!B 022 !!R 042 " 062 2 102 !B 122 !R 142 B 162 R 3558003 !!C 023 !!S 043 # 063 3 103 !C 123 !S 143 C 163 S 3559004 !!D 024 !!T 044 $ 064 4 104 !D 124 !T 144 D 164 T 3560005 !!E 025 !!U 045 % 065 5 105 !E 125 !U 145 E 165 U 3561006 !!F 026 !!V 046 & 066 6 106 !F 126 !V 146 F 166 V 3562007 !!G 027 !!W 047 ' 067 7 107 !G 127 !W 147 G 167 W 3563010 !!H 030 !!X 050 ( 070 8 110 !H 130 !X 150 H 170 X 3564011 !!I 031 !!Y 051 ) 071 9 111 !I 131 !Y 151 I 171 Y 3565012 !!J 032 !!Z 052 * 072 : 112 !J 132 !Z 152 J 172 Z 3566013 !!K 033 !![ 053 + 073 ; 113 !K 133 [ 153 K 173 ![ 3567014 !!L 034 !!\ 054 , 074 < 114 !L 134 \ 154 L 174 !\ 3568015 !!M 035 !!] 055 - 075 = 115 !M 135 ] 155 M 175 !] 3569016 !!N 036 !!^ 056 . 076 > 116 !N 136 ^ 156 N 176 !^ 3570017 !!O 037 !!_ 057 / 077 ? 117 !O 137 _ 157 O 177 !_ 3571@end example 3572 3573@node Micros, Miscellaneous, CDC, Top 3574@chapter Other micro-computer charsets 3575 3576@cindex NeXT charsets 3577The @code{NeXT} charset, which used to be especially provided in releases of 3578@code{recode} before 3.5, has been integrated since as one @w{RFC 1345} table. 3579 3580@menu 3581* Apple-Mac:: Apple's Macintosh code 3582* AtariST:: Atari ST code 3583@end menu 3584 3585@node Apple-Mac, AtariST, Micros, Micros 3586@section Apple's Macintosh code 3587 3588@tindex Apple-Mac 3589@cindex Macintosh charset 3590This charset is available in @code{recode} under the name @code{Apple-Mac}. 3591The shortest way of specifying it in @code{recode} is @code{ap}. 3592 3593The charset is aimed towards a Macintosh micro-computer from Apple. 3594This is an eight bit code. The file is the data fork only. This charset 3595is fairly old in @code{recode}, its tables were produced a long while ago 3596by mere inspection of a printed chart of the Macintosh codes and glyph. 3597 3598@cindex CR surface, in Macintosh charsets 3599It has @code{CR} as its implied surface. This means that, if the original 3600end of lines have to be preserved while going out of @code{Apple-Mac}, they 3601should currently be added back through the usage of a surface on the other 3602charset, or better, just never removed. Here are examples for both cases: 3603 3604@example 3605recode ap..l2/cr < @var{input} > @var{output} 3606recode ap/..l2 < @var{input} > @var{output} 3607@end example 3608 3609@w{RFC 1345} brings into @code{recode} 2 other Macintosh charsets. You can 3610discover them by using @code{grep} over the output of @samp{recode -l}: 3611 3612@example 3613recode -l | grep -i mac 3614@end example 3615 3616@noindent 3617@tindex macintosh@r{, a charset, and its aliases} 3618@tindex macintosh_ce@r{, and its aliases} 3619@tindex mac 3620@tindex macce 3621Charsets @code{macintosh} and @code{macintosh_ce}, as well as their aliases 3622@code{mac} and @code{macce} also have @code{CR} as their implied surface. 3623 3624There are a few discrepancies between the @code{Apple-Mac} charset and 3625the very similar @w{RFC 1345} charset @code{macintosh}, which have not been 3626analysed yet, so the charsets are being kept separate for now. This might 3627change in the future, and the @code{Apple-Mac} charset might disappear. 3628Wizards would be interested in comparing the output of these two commands: 3629 3630@example 3631recode -vh Apple-Mac..Latin-1 3632recode -vh macintosh..Latin-1 3633@end example 3634 3635@noindent 3636The first command use the charset prior to @w{RFC 1345} introduction. 3637Both methods give different recodings. These differences are annoying, 3638the fuzziness will have to be explained and settle down one day. 3639 3640@cindex @code{recode}, a Macintosh port 3641As a side note, some people ask if there is a Macintosh port of the 3642@code{recode} program. I'm not aware of any. I presume that if the tool 3643fills a need for Macintosh users, someone will port it one of these days? 3644 3645@node AtariST, , Apple-Mac, Micros 3646@section Atari ST code 3647 3648@tindex AtariST 3649This charset is available in @code{recode} under the name @code{AtariST}. 3650 3651This is the character set used on the Atari ST/TT/Falcon. This is similar 3652to @code{IBM-PC}, but differs in some details: it includes some more accented 3653characters, the graphic characters are mostly replaced by Hebrew characters, 3654and there is a true German @kbd{sharp s} different from Greek @kbd{beta}. 3655 3656About the end-of-line conversions: the canonical end-of-line on the 3657Atari is @samp{\r\n}, but unlike @code{IBM-PC}, the OS makes no 3658difference between text and binary input/output; it is up to the 3659application how to interpret the data. In fact, most of the libraries 3660that come with compilers can grok both @samp{\r\n} and @samp{\n} as end 3661of lines. Many of the users who also have access to Unix systems prefer 3662@samp{\n} to ease porting Unix utilities. So, for easing reversibility, 3663@code{recode} tries to let @samp{\r} undisturbed through recodings. 3664 3665@node Miscellaneous, Surfaces, Micros, Top 3666@chapter Various other charsets 3667 3668Even if these charsets were originally added to @code{recode} for 3669handling texts written in French, they find other uses. We did use them 3670a lot for writing French diacriticised texts in the past, so @code{recode} 3671knows how to handle these particularly well for French texts. 3672 3673@menu 3674* HTML:: World Wide Web representations 3675* LaTeX:: LaTeX macro calls 3676* Texinfo:: GNU project documentation files 3677* Vietnamese:: 3678* African:: African charsets 3679* Others:: 3680* Texte:: Easy French conventions 3681* Mule:: Mule as a multiplexed charset 3682@end menu 3683 3684@node HTML, LaTeX, Miscellaneous, Miscellaneous 3685@section World Wide Web representations 3686 3687@cindex HTML 3688@cindex SGML 3689@cindex XML 3690@cindex Web 3691@cindex World Wide Web 3692@cindex WWW 3693@cindex markup language 3694@cindex entities 3695@cindex character entities 3696@cindex character entity references 3697@cindex numeric character references 3698Character entities have been introduced by SGML and made widely popular 3699through HTML, the markup language in use for the World Wide Web, or Web or 3700WWW for short. For representing @emph{unusual} characters, HTML texts use 3701special sequences, beginning with an ampersand @kbd{&} and ending with a 3702semicolon @kbd{;}. The sequence may itself start with a number sigh @kbd{#} 3703and be followed by digits, so forming a @dfn{numeric character reference}, 3704or else be an alphabetic identifier, so forming a @dfn{character entity 3705reference}. 3706 3707The HTML standards have been revised into different HTML levels over time, 3708and the list of allowable character entities differ in them. The later XML, 3709meant to simplify many things, has an option (@samp{standalone=yes}) which 3710much restricts that list. The @code{recode} library is able to convert 3711character references between their mnemonic form and their numeric form, 3712depending on aimed HTML standard level. It also can, of course, convert 3713between HTML and various other charsets. 3714 3715Here is a list of those HTML variants which @code{recode} supports. 3716Some notes have been provided by Francois Yergeau @email{yergeau@@alis.com}. 3717 3718@table @code 3719@item XML-standalone 3720@tindex h0 3721@tindex XML-standalone 3722This charset is available in @code{recode} under the name 3723@code{XML-standalone}, with @code{h0} as an acceptable alias. It is 3724documented in section 4.1 of @uref{http://www.w3.org/TR/REC-xml}. 3725It only knows @samp{&}, @samp{>}, @samp{<}, @samp{"} 3726and @samp{'}. 3727 3728@item HTML_1.1 3729@tindex HTML_1.1 3730@tindex h1 3731This charset is available in @code{recode} under the name @code{HTML_1.1}, 3732with @code{h1} as an acceptable alias. HTML 1.0 was never really documented. 3733 3734@item HTML_2.0 3735@tindex HTML_2.0 3736@tindex RFC1866 3737@tindex 1866 3738@tindex h2 3739This charset is available in @code{recode} under the name @code{HTML_2.0}, 3740and has @code{RFC1866}, @code{1866} and @code{h2} for aliases. HTML 2.0 3741entities are listed in @w{RFC 1866}. Basically, there is an entity for 3742each @emph{alphabetical} character in the right part of @w{ISO 8859-1}. 3743In addition, there are four entities for syntax-significant ASCII characters: 3744@samp{&}, @samp{>}, @samp{<} and @samp{"}. 3745 3746@item HTML-i18n 3747@tindex HTML-i18n 3748@tindex RFC2070 3749@tindex 2070 3750This charset is available in @code{recode} under the name 3751@code{HTML-i18n}, and has @code{RFC2070} and @code{2070} for 3752aliases. @w{RFC 2070} added entities to cover the whole right 3753part of @w{ISO 8859-1}. The list is conveniently accessible at 3754@uref{http://www.alis.com:8085/ietf/html/html-latin1.sgml}. In addition, 3755four i18n-related entities were added: @samp{‌} (@samp{‌}), 3756@samp{‍} (@samp{‍}), @samp{‎} (@samp{‎}) and @samp{‏} 3757(@samp{‏}). 3758 3759@item HTML_3.2 3760@tindex HTML_3.2 3761@tindex h3 3762This charset is available in @code{recode} under the name 3763@code{HTML_3.2}, with @code{h3} as an acceptable alias. 3764@uref{http://www.w3.org/TR/REC-html32.html, HTML 3.2} took up the full 3765@w{Latin-1} list but not the i18n-related entities from @w{RFC 2070}. 3766 3767@item HTML_4.0 3768@tindex h4 3769@tindex h 3770This charset is available in @code{recode} under the name @code{HTML_4.0}, 3771and has @code{h4} and @code{h} for aliases. Beware that the particular 3772alias @code{h} is not @emph{tied} to HTML 4.0, but to the highest HTML 3773level supported by @code{recode}; so it might later represent HTML level 37745 if this is ever created. @uref{http://www.w3.org/TR/REC-html40/, 3775HTML 4.0} has the whole @w{Latin-1} list, a set of entities for 3776symbols, mathematical symbols, and Greek letters, and another set for 3777markup-significant and internationalization characters comprising the 37784 ASCII entities, the 4 i18n-related from @w{RFC 2070} plus some more. 3779See @uref{http://www.w3.org/TR/REC-html40/sgml/entities.html}. 3780 3781@end table 3782 3783Printable characters from @w{Latin-1} may be used directly in an HTML text. 3784However, partly because people have deficient keyboards, partly because 3785people want to transmit HTML texts over non 8-bit clean channels while not 3786using MIME, it is common (yet debatable) to use character entity references 3787even for @w{Latin-1} characters, when they fall outside ASCII (that is, 3788when they have the 8th bit set). 3789 3790When you recode from another charset to @code{HTML}, beware that all 3791occurrences of double quotes, ampersands, and left or right angle brackets 3792are translated into special sequences. However, in practice, people often 3793use ampersands and angle brackets in the other charset for introducing 3794HTML commands, compromising it: it is not pure HTML, not it is pure 3795other charset. These particular translations can be rather inconvenient, 3796they may be specifically inhibited through the command option @samp{-d} 3797(@pxref{Mixed}). 3798 3799Codes not having a mnemonic entity are output by @code{recode} using the 3800@samp{&#@var{nnn};} notation, where @var{nnn} is a decimal representation 3801of the UCS code value. When there is an entity name for a character, it 3802is always preferred over a numeric character reference. ASCII printable 3803characters are always generated directly. So is the newline. While reading 3804HTML, @code{recode} supports numeric character reference as alternate 3805writings, even when written as hexadecimal numbers, as in @samp{�}. 3806This is documented in: 3807 3808@example 3809http://www.w3.org/TR/REC-html40/intro/sgmltut.html#h-3.2.3 3810@end example 3811 3812When @code{recode} translates to HTML, the translation occurs according to 3813the HTML level as selected by the goal charset. When translating @emph{from} 3814HTML, @code{recode} not only accepts the character entity references known at 3815that level, but also those of all other levels, as well as a few alternative 3816special sequences, to be forgiving to files using other HTML standards. 3817 3818@cindex normilise an HTML file 3819@cindex HTML normalization 3820The @code{recode} program can be used to @emph{normalise} an HTML file using 3821oldish conventions. For example, it accepts @samp{&AE;}, as this once was a 3822valid writing, somewhere. However, it should always produce @samp{Æ} 3823instead of @samp{&AE;}. Yet, this is not completely true. If one does: 3824 3825@example 3826recode h3..h3 < @var{input} 3827@end example 3828 3829@noindent 3830the operation will be optimised into a mere copy, and you can get @samp{&AE;} 3831this way, if you had some in your input file. But if you explicitly defeat 3832the optimisation, like this maybe: 3833 3834@example 3835recode h3..u2,u2..h3 < @var{input} 3836@end example 3837 3838@noindent 3839then @samp{&AE;} should be normalised into @samp{Æ} by the operation. 3840 3841@node LaTeX, Texinfo, HTML, Miscellaneous 3842@section La@TeX{} macro calls 3843 3844@tindex LaTeX@r{, a charset} 3845@tindex ltex 3846@cindex La@TeX{} files 3847@cindex @TeX{} files 3848This charset is available in @code{recode} under the name @code{LaTeX} 3849and has @code{ltex} as an alias. It is used for ASCII files coded to be 3850read by La@TeX{} or, in certain cases, by @TeX{}. 3851 3852Whenever you recode from another charset to @code{LaTeX}, beware that all 3853occurrences of backslashes @kbd{\} are translated into the string 3854@samp{\backslash@{@}}. However, in practice, people often use backslashes 3855in the other charset for introducing @TeX{} commands, compromising it: 3856it is not pure @TeX{}, nor it is pure other charset. This translation 3857of backslashes into @samp{\backslash@{@}} can be rather inconvenient, 3858it may be inhibited through the command option @samp{-d} (@pxref{Mixed}). 3859 3860@node Texinfo, Vietnamese, LaTeX, Miscellaneous 3861@section GNU project documentation files 3862 3863@tindex Texinfo@r{, a charset} 3864@tindex texi 3865@tindex ti 3866@cindex Texinfo files 3867This charset is available in @code{recode} under the name @code{Texinfo} 3868and has @code{texi} and @code{ti} for aliases. It is used by the GNU 3869project for its documentation. Texinfo files may be converted into Info 3870files by the @code{makeinfo} program and into nice printed manuals by 3871the @TeX{} system. 3872 3873Even if @code{recode} may transform other charsets to Texinfo, it may 3874not read Texinfo files yet. In these times, usages are also changing 3875between versions of Texinfo, and @code{recode} only partially succeeds 3876in correctly following these changes. So, for now, Texinfo support in 3877@code{recode} should be considered as work still in progress (!). 3878 3879@node Vietnamese, African, Texinfo, Miscellaneous 3880@section Vietnamese charsets 3881 3882@cindex Vietnamese charsets 3883We are currently experimenting the implementation, in @code{recode}, of a few 3884character sets and transliterated forms to handle the Vietnamese language. 3885They are quite briefly summarised, here. 3886 3887@table @code 3888@item TCVN 3889@tindex TCVN@r{, for Vienamese} 3890@tindex VN1@r{, maybe not available} 3891@tindex VN2@r{, maybe not available} 3892@tindex VN3@r{, maybe not available} 3893The TCVN charset has an incomplete name. It might be one of the three 3894charset @code{VN1}, @code{VN2} or @code{VN3}. Yes @code{VN2} might be a 3895second version of @code{VISCII}. To be clarified. 3896 3897@item VISCII 3898@tindex VISCII 3899This is an 8-bit character set which seems to be rather popular for 3900writing Vietnamese. 3901 3902@item VPS 3903@tindex VPS 3904This is an 8-bit character set for Vietnamese. No much reference. 3905 3906@item VIQR 3907@tindex VIQR 3908The VIQR convention is a 7-bit, @code{ASCII} transliteration for Vietnamese. 3909 3910@item VNI 3911@tindex VNI 3912The VNI convention is a 8-bit, @code{Latin-1} transliteration for Vietnamese. 3913@end table 3914 3915@tindex 1129@r{, not available} 3916@tindex CP1129@r{, not available} 3917@tindex 1258@r{, not available} 3918@tindex CP1258@r{, not available} 3919Still lacking for Vietnamese in @code{recode}, are the charsets @code{CP1129} 3920and @code{CP1258}. 3921 3922@node African, Others, Vietnamese, Miscellaneous 3923@section African charsets 3924 3925@cindex African charsets 3926Some African character sets are available for a few languages, when these 3927are heavily used in countries where French is also currently spoken. 3928 3929@tindex AFRFUL-102-BPI_OCIL@r{, and aliases} 3930@tindex bambara 3931@tindex bra 3932@tindex ewondo 3933@tindex fulfude 3934@tindex AFRFUL-103-BPI_OCIL@r{, and aliases} 3935@tindex t-bambara 3936@tindex t-bra 3937@tindex t-ewondo 3938@tindex t-fulfude 3939One African charset is usable for Bambara, Ewondo and Fulfude, as well 3940as for French. This charset is available in @code{recode} under the name 3941@code{AFRFUL-102-BPI_OCIL}. Accepted aliases are @code{bambara}, @code{bra}, 3942@code{ewondo} and @code{fulfude}. Transliterated forms of the same are 3943available under the name @code{AFRFUL-103-BPI_OCIL}. Accepted aliases 3944are @code{t-bambara}, @code{t-bra}, @code{t-ewondo} and @code{t-fulfude}. 3945 3946@tindex AFRLIN-104-BPI_OCIL 3947@tindex lingala 3948@tindex lin 3949@tindex sango 3950@tindex wolof 3951@tindex AFRLIN-105-BPI_OCIL 3952@tindex t-lingala 3953@tindex t-lin 3954@tindex t-sango 3955@tindex t-wolof 3956Another African charset is usable for Lingala, Sango and Wolof, as well 3957as for French. This charset is available in @code{recode} under the 3958name @code{AFRLIN-104-BPI_OCIL}. Accepted aliases are @code{lingala}, 3959@code{lin}, @code{sango} and @code{wolof}. Transliterated forms of the same 3960are available under the name @code{AFRLIN-105-BPI_OCIL}. Accepted aliases 3961are @code{t-lingala}, @code{t-lin}, @code{t-sango} and @code{t-wolof}. 3962 3963@tindex AFRL1-101-BPI_OCIL 3964@tindex t-francais 3965@tindex t-fra 3966To ease exchange with @code{ISO-8859-1}, there is a charset conveying 3967transliterated forms for @w{Latin-1} in a way which is compatible with the other 3968African charsets in this series. This charset is available in @code{recode} 3969under the name @code{AFRL1-101-BPI_OCIL}. Accepted aliases are @code{t-fra} 3970and @code{t-francais}. 3971 3972@node Others, Texte, African, Miscellaneous 3973@section Cyrillic and other charsets 3974 3975@cindex Cyrillic charsets 3976The following Cyrillic charsets are already available in @code{recode} 3977through @w{RFC 1345} tables: @code{CP1251} with aliases @code{1251}, @code{ 3978ms-cyrl} and @code{windows-1251}; @code{CSN_369103} with aliases 3979@code{ISO-IR-139} and @code{KOI8_L2}; @code{ECMA-cyrillic} with aliases 3980@code{ECMA-113}, @code{ECMA-113:1986} and @code{iso-ir-111}, @code{IBM880} 3981with aliases @code{880}, @code{CP880} and @code{EBCDIC-Cyrillic}; 3982@code{INIS-cyrillic} with alias @code{iso-ir-51}; @code{ISO-8859-5} with 3983aliases @code{cyrillic}, @code{ ISO-8859-5:1988} and @code{iso-ir-144}; 3984@code{KOI-7}; @code{KOI-8} with alias @code{GOST_19768-74}; @code{KOI8-R}; 3985@code{KOI8-RU} and finally @code{KOI8-U}. 3986 3987There seems to remain some confusion in Roman charsets for Cyrillic 3988languages, and because a few users requested it repeatedly, @code{recode} 3989now offers special services in that area. Consider these charsets as 3990experimental and debatable, as the extraneous tables describing them are 3991still a bit fuzzy or non-standard. Hopefully, in the long run, these 3992charsets will be covered in Keld Simonsen's works to the satisfaction of 3993everybody, and this section will merely disappear. 3994 3995@table @code 3996@item KEYBCS2 3997@tindex KEYBCS2 3998@tindex Kamenicky 3999This charset is available under the name @code{KEYBCS2}, with 4000@code{Kamenicky} as an accepted alias. 4001 4002@item CORK 4003@tindex CORK 4004@tindex T1 4005This charset is available under the name @code{CORK}, with @code{T1} 4006as an accepted alias. 4007 4008@item KOI-8_CS2 4009@tindex KOI-8_CS2 4010This charset is available under the name @code{KOI-8_CS2}. 4011@end table 4012 4013@node Texte, Mule, Others, Miscellaneous 4014@section Easy French conventions 4015 4016@tindex Texte 4017@tindex txte 4018This charset is available in @code{recode} under the name @code{Texte} 4019and has @code{txte} for an alias. It is a seven bits code, identical 4020to @code{ASCII-BS}, save for French diacritics which are noted using a 4021slightly different convention. 4022 4023At text entry time, these conventions provide a little speed up. At read 4024time, they slightly improve the readability over a few alternate ways 4025of coding diacritics. Of course, it would better to have a specialised 4026keyboard to make direct eight bits entries and fonts for immediately 4027displaying eight bit ISO @w{Latin-1} characters. But not everybody is so 4028fortunate. In a few mailing environments, and sadly enough, it still 4029happens that the eight bit is often willing-fully destroyed. 4030 4031@cindex Easy French 4032Easy French has been in use in France for a while. I only slightly 4033adapted it (the diaeresis option) to make it more comfortable to several 4034usages in Qu@'ebec originating from Universit@'e de Montr@'eal. In fact, 4035the main problem for me was not to necessarily to invent Easy French, but 4036to recognise the ``best'' convention to use, (best is not being defined, 4037here) and to try to solve the main pitfalls associated with the selected 4038convention. Shortly said, we have: 4039 4040@table @kbd 4041@item e' 4042for @kbd{e} (and some other vowels) with an acute accent, 4043@item e` 4044for @kbd{e} (and some other vowels) with a grave accent, 4045@item e^ 4046for @kbd{e} (and some other vowels) with a circumflex accent, 4047@item e" 4048for @kbd{e} (and some other vowels) with a diaeresis, 4049@item c, 4050for @kbd{c} with a cedilla. 4051@end table 4052 4053@noindent 4054There is no attempt at expressing the @kbd{ae} and @kbd{oe} diphthongs. 4055French also uses tildes over @kbd{n} and @kbd{a}, but seldomly, and this 4056is not represented either. In some countries, @kbd{:} is used instead 4057of @kbd{"} to mark diaeresis. @code{recode} supports only one convention 4058per call, depending on the @samp{-c} option of the @code{recode} command. 4059French quotes (sometimes called ``angle quotes'') are noted the same way 4060English quotes are noted in @TeX{}, @emph{id est} by @kbd{``} and @kbd{''}. 4061No effort has been put to preserve Latin ligatures (@kbd{@ae{}}, @kbd{@oe{}}) 4062which are representable in several other charsets. So, these ligatures 4063may be lost through Easy French conventions. 4064 4065The convention is prone to losing information, because the diacritic 4066meaning overloads some characters that already have other uses. To 4067alleviate this, some knowledge of the French language is boosted into 4068the recognition routines. So, the following subtleties are systematically 4069obeyed by the various recognisers. 4070 4071@enumerate 4072@item 4073A comma which follows a @kbd{c} is interpreted as a cedilla only if it is 4074followed by one of the vowels @kbd{a}, @kbd{o} or @kbd{u}. 4075 4076@item 4077A single quote which follows a @kbd{e} does not necessarily means an acute 4078accent if it is followed by a single other one. For example: 4079 4080@table @kbd 4081@item e' 4082will give an @kbd{e} with an acute accent. 4083@item e'' 4084will give a simple @kbd{e}, with a closing quotation mark. 4085@item e''' 4086will give an @kbd{e} with an acute accent, followed by a closing quotation 4087mark. 4088@end table 4089 4090There is a problem induced by this convention if there are English 4091quotations with a French text. In sentences like: 4092 4093@example 4094There's a meeting at Archie's restaurant. 4095@end example 4096 4097the single quotes will be mistaken twice for acute accents. So English 4098contractions and suffix possessives could be mangled. 4099 4100@item 4101A double quote or colon, depending on @samp{-c} option, which follows a 4102vowel is interpreted as diaeresis only if it is followed by another letter. 4103But there are in French several words that @emph{end} with a diaeresis, 4104and the @code{recode} library is aware of them. There are words ending in 4105``igue'', either feminine words without a relative masculine (besaigu@"e 4106and cigu@"e), or feminine words with a relative masculine@footnote{There 4107are supposed to be seven words in this case. So, one is missing.} 4108(aigu@"e, ambigu@"e, contigu@"e, exigu@"e, subaigu@"e and suraigu@"e). 4109There are also words not ending in ``igue'', but instead, either ending by 4110``i''@footnote{Look at one of the following sentences (the second has to 4111be interpreted with the @samp{-c} option): 4112 4113@example 4114"Ai"e! Voici le proble`me que j'ai" 4115Ai:e! Voici le proble`me que j'ai: 4116@end example 4117 4118There is an ambiguity between an 4119@tex 4120a\"\i, 4121@end tex 4122@ifinfo 4123ai", 4124@end ifinfo 4125@c FIXME: why not use @dotless{} here? It works, AFAIK. 4126@ignore 4127a@"{@dotless{i}}, 4128@end ignore 4129the small animal, and the indicative future of @emph{avoir} (first person 4130singular), when followed by what could be a diaeresis mark. Hopefully, 4131the case is solved by the fact that an apostrophe always precedes the 4132verb and almost never the animal.} 4133@tex 4134(a\"\i, conga\"\i, go\"\i, ha\"\i ka\"\i, inou\"\i, sa\"\i, samura\"\i, 4135tha\"\i{} and toka\"\i), 4136@end tex 4137@ifinfo 4138(ai", congai", goi", hai"kai", inoui", sai", samurai", thai" and tokai"), 4139@end ifinfo 4140@ignore 4141(a@"{@dotless{i}}, conga@"{@dotless{i}}, go@"{@dotless{i}}, 4142ha@"{@dotless{i}}ka@"{@dotless{i}}, inou@"{@dotless{i}}, sa@"{@dotless{i}}, 4143samura@"{@dotless{i}}, tha@"{@dotless{i}} and toka@"{@dotless{i}}), 4144@end ignore 4145ending by ``e'' (cano@"e) or ending by ``u''@footnote{I did not pay 4146attention to proper nouns, but this one showed up as being fairly evident.} 4147(Esa@"u). 4148 4149Just to complete this topic, note that it would be wrong to make a rule 4150for all words ending in ``igue'' as needing a diaerisis, as there are 4151counter-examples (becfigue, b@`esigue, bigue, bordigue, bourdigue, brigue, 4152contre-digue, digue, d'intrigue, fatigue, figue, garrigue, gigue, igue, 4153intrigue, ligue, prodigue, sarigue and zigue). 4154@end enumerate 4155 4156@node Mule, , Texte, Miscellaneous 4157@section Mule as a multiplexed charset 4158 4159@tindex Mule@r{, a charset} 4160@cindex multiplexed charsets 4161@cindex super-charsets 4162This version of @code{recode} barely starts supporting multiplexed or 4163super-charsets, that is, those encoding methods by which a single text 4164stream may contain a combination of more than one constituent charset. 4165The only multiplexed charset in @code{recode} is @code{Mule}, and even 4166then, it is only very partially implemented: the only correspondence 4167available is with @code{Latin-1}. The author fastly implemented this 4168only because he needed this for himself. However, it is intended that 4169Mule support to become more real in subsequent releases of @code{recode}. 4170 4171Multiplexed charsets are not to be confused with mixed charset texts 4172(@pxref{Mixed}). For mixed charset input, the rules allowing to distinguish 4173which charset is current, at any given place, are kind of informal, and 4174driven from the semantics of what the file contains. On the other side, 4175multiplexed charsets are @emph{designed} to be interpreted fairly precisely, 4176and quite independently of any informational context. 4177 4178@cindex MULE, in Emacs 4179The spelling @code{Mule} originally stands for @cite{@emph{mul}tilingual 4180@emph{e}nhancement to GNU Emacs}, it is the result of a collective 4181effort orchestrated by Handa Ken'ichi since 1993. When @code{Mule} got 4182rewritten in the main development stream of GNU Emacs 20, the FSF renamed 4183it @code{MULE}, meaning @cite{@emph{mul}tilingual @emph{e}nvironment 4184in GNU Emacs}. Even if the charset @code{Mule} is meant to stay 4185internal to GNU Emacs, it sometimes breaks loose in external files, 4186and as a consequence, a recoding tool is sometimes needed. Within Emacs, 4187@code{Mule} comes with @code{Leim}, which stands for @cite{@emph{l}ibraries 4188of @emph{e}macs @emph{i}nput @emph{m}ethods}. One of these libraries is 4189named @code{quail}@footnote{Usually, quail means quail egg in Japanese, 4190while egg alone is usually chicken egg. Both quail egg and chicken 4191egg are popular food in Japan. The @code{quail} input system has 4192been named because it is smaller that the previous @code{EGG} system. 4193As for @code{EGG}, it is the translation of @code{TAMAGO}. This word 4194comes from the Japanese sentence @cite{@emph{ta}kusan @emph{ma}tasete 4195@emph{go}mennasai}, meaning @cite{sorry to have let you wait so long}. 4196Of course, the publication of @code{EGG} has been delayed many times@dots{} 4197(Story by Takahashi Naoto)}. 4198 4199@node Surfaces, Internals, Miscellaneous, Top 4200@chapter All about surfaces 4201@cindex surface, what it is 4202 4203@cindex trivial surface 4204The @dfn{trivial surface} consists of using a fixed number of bits 4205(often eight) for each character, the bits together hold the integer 4206value of the index for the character in its charset table. There are 4207many kinds of surfaces, beyond the trivial one, all having the purpose 4208of increasing selected qualities for the storage or transmission. 4209For example, surfaces might increase the resistance to channel limits 4210(@code{Base64}), the transmission speed (@code{gzip}), the information 4211privacy (@code{DES}), the conformance to operating system conventions 4212(@code{CR-LF}), the blocking into records (@code{VB}), and surely other 4213things as well@footnote{These are mere examples to explain the concept, 4214@code{recode} only has @code{Base64} and @code{CR-LF}, actually.}. 4215Many surfaces may be applied to a stream of characters from a charset, 4216the order of application of surfaces is important, and surfaces 4217should be removed in the reverse order of their application. 4218 4219Even if surfaces may generally be applied to various charsets, some 4220surfaces were specifically designed for a particular charset, and would 4221not make much sense if applied to other charsets. In such cases, these 4222conceptual surfaces have been implemented as @code{recode} charsets, 4223instead of as surfaces. This choice yields to cleaner syntax 4224and usage. @xref{Universal}. 4225 4226@cindex surfaces, implementation in @code{recode} 4227@tindex data@r{, a special charset} 4228@tindex tree@r{, a special charset} 4229Surfaces are implemented within @code{recode} as special charsets 4230which may only transform to or from the @code{data} or @code{tree} 4231special charsets. Clever users may use this knowledge for writing 4232surface names in requests exactly as if they were pure charsets, when 4233the only need is to change surfaces without any kind of recoding between 4234real charsets. In such contexts, either @code{data} or @code{tree} may 4235also be used as if it were some kind of generic, anonymous charset: the 4236request @samp{data..@var{surface}} merely adds the given @var{surface}, 4237while the request @samp{@var{surface}..data} removes it. 4238 4239@cindex structural surfaces 4240@cindex surfaces, structural 4241@cindex surfaces, trees 4242The @code{recode} library distinguishes between mere data surfaces, and 4243structural surfaces, also called tree surfaces for short. Structural 4244surfaces might allow, in the long run, transformations between a few 4245specialised representations of structural information like MIME parts, 4246Perl or Python initialisers, LISP S-expressions, XML, Emacs outlines, etc. 4247 4248We are still experimenting with surfaces in @code{recode}. The concept opens 4249the doors to many avenues; it is not clear yet which ones are worth pursuing, 4250and which should be abandoned. In particular, implementation of structural 4251surfaces is barely starting, there is not even a commitment that tree 4252surfaces will stay in @code{recode}, if they do prove to be more cumbersome 4253than useful. This chapter presents all surfaces currently available. 4254 4255@menu 4256* Permutations:: Permuting groups of bytes 4257* End lines:: Representation for end of lines 4258* MIME:: MIME contents encodings 4259* Dump:: Interpreted character dumps 4260* Test:: Artificial data for testing 4261@end menu 4262 4263@node Permutations, End lines, Surfaces, Surfaces 4264@section Permuting groups of bytes 4265@cindex permutations of groups of bytes 4266 4267@cindex byte order swapping 4268@cindex endiannes, changing 4269A permutation is a surface transformation which reorders groups of 4270eight-bit bytes. A @emph{21} permutation exchanges pairs of successive 4271bytes. If the text contains an odd number of bytes, the last byte is 4272merely copied. An @emph{4321} permutation inverts the order of quadruples 4273of bytes. If the text does not contains a multiple of four bytes, the 4274remaining bytes are nevertheless permuted as @emph{321} if there are 4275three bytes, @emph{21} if there are two bytes, or merely copied otherwise. 4276 4277@table @code 4278@item 21 4279@tindex 21-Permutation 4280@tindex swabytes 4281This surface is available in @code{recode} under the name 4282@code{21-Permutation} and has @code{swabytes} for an alias. 4283 4284@item 4321 4285@tindex 4321-Permutation 4286This surface is available in @code{recode} under the name 4287@code{4321-Permutation}. 4288@end table 4289 4290@node End lines, MIME, Permutations, Surfaces 4291@section Representation for end of lines 4292@cindex end of line format 4293 4294The same charset might slightly differ, from one system to another, for 4295the single fact that end of lines are not represented identically on all 4296systems. The representation for an end of line within @code{recode} 4297is the @code{ASCII} or @code{UCS} code with value 10, or @kbd{LF}. Other 4298conventions for representing end of lines are available through surfaces. 4299 4300@table @code 4301@item CR 4302@tindex CR@r{, a surface} 4303This convention is popular on Apple's Macintosh machines. When this 4304surface is applied, each line is terminated by @kbd{CR}, which has 4305@code{ASCII} value 13. Unless the library is operating in strict mode, 4306adding or removing the surface will in fact @emph{exchange} @kbd{CR} and 4307@kbd{LF}, for better reversibility. However, in strict mode, the exchange 4308does not happen, any @kbd{CR} will be copied verbatim while applying 4309the surface, and any @kbd{LF} will be copied verbatim while removing it. 4310 4311This surface is available in @code{recode} under the name @code{CR}, 4312it does not have any aliases. This is the implied surface for the Apple 4313Macintosh related charsets. 4314 4315@item CR-LF 4316@tindex CR-LF@r{, a surface} 4317This convention is popular on Microsoft systems running on IBM PCs and 4318compatible. When this surface is applied, each line is terminated by 4319a sequence of two characters: one @kbd{CR} followed by one @kbd{LF}, 4320in that order. 4321 4322@cindex Ctrl-Z, discarding 4323For compatibility with oldish MS-DOS systems, removing a @code{CR-LF} 4324surface will discard the first encountered @kbd{C-z}, which has 4325@code{ASCII} value 26, and everything following it in the text. 4326Adding this surface will not, however, append a @kbd{C-z} to the result. 4327 4328@tindex cl 4329This surface is available in @code{recode} under the name @code{CR-LF} 4330and has @code{cl} for an alias. This is the implied surface for the IBM 4331or Microsoft related charsets or code pages. 4332@end table 4333 4334Some other charsets might have their own representation for an end of 4335line, which is different from @kbd{LF}. For example, this is the case 4336of various @code{EBCDIC} charsets, or @code{Icon-QNX}. The recoding of 4337end of lines is intimately tied into such charsets, it is not available 4338separately as surfaces. 4339 4340@node MIME, Dump, End lines, Surfaces 4341@section MIME contents encodings 4342@cindex MIME encodings 4343 4344@cindex RFC 2045 4345@w{RFC 2045} defines two 7-bit surfaces, meant to prepare 8-bit messages for 4346transmission. Base64 is especially usable for binary entities, while 4347Quoted-Printable is especially usable for text entities, in those case 4348the lower 128 characters of the underlying charset coincide with ASCII. 4349 4350@table @code 4351@tindex Base64 4352@tindex b64 4353@tindex 64 4354@item Base64 4355This surface is available in @code{recode} under the name @code{Base64}, 4356with @code{b64} and @code{64} as acceptable aliases. 4357 4358@item Quoted-Printable 4359@tindex Quoted-Printable 4360@tindex quote-printable 4361@tindex QP 4362This surface is available in @code{recode} under the name 4363@code{Quoted-Printable}, with @code{quote-printable} and @code{QP} as 4364acceptable aliases. 4365@end table 4366 4367Note that @code{UTF-7}, which may be also considered as a MIME surface, 4368is provided as a genuine charset instead, as it necessary relates to 4369@code{UCS-2} and nothing else. @xref{UTF-7}. 4370 4371A little historical note, also showing the three levels of acceptance of 4372Internet standards. MIME changed from a ``Proposed Standard'' (@w{RFC 43731341--1344}, 1992) to a ``Draft Standard'' (@w{RFC 1521--1523}) in 1993, 4374and was @emph{recycled} as a ``Draft Standard'' in 1996-11. It is not yet a 4375``Full Standard''. 4376 4377@node Dump, Test, MIME, Surfaces 4378@section Interpreted character dumps 4379 4380@cindex dumping characters 4381Dumps are surfaces meant to express, in ways which are a bit more readable, 4382the bit patterns used to represent characters. They allow the inspection 4383or debugging of character streams, but also, they may assist a bit the 4384production of C source code which, once compiled, would hold in memory a 4385copy of the original coding. However, @code{recode} does not attempt, in 4386any way, to produce complete C source files in dumps. User hand editing 4387or @file{Makefile} trickery is still needed for adding missing lines. 4388Dumps may be given in decimal, hexadecimal and octal, and be based over 4389chunks of either one, two or four eight-bit bytes. Formatting has been 4390chosen to respect the C language syntax for number constants, with commas 4391and newlines inserted appropriately. 4392 4393However, when dumping two or four byte chunks, the last chunk may be 4394incomplete. This is observable through the usage of narrower expression 4395for that last chunk only. Such a shorter chunk would not be compiled 4396properly within a C initialiser, as all members of an array share a single 4397type, and so, have identical sizes. 4398 4399@table @code 4400@item Octal-1 4401@tindex Octal-1 4402@tindex o1 4403This surface corresponds to an octal expression of each input byte. 4404 4405It is available in @code{recode} under the name @code{Octal-1}, 4406with @code{o1} and @code{o} as acceptable aliases. 4407 4408@item Octal-2 4409@tindex Octal-2 4410@tindex o2 4411This surface corresponds to an octal expression of each pair of 4412input bytes, except for the last pair, which may be short. 4413 4414It is available in @code{recode} under the name @code{Octal-2} 4415and has @code{o2} for an alias. 4416 4417@item Octal-4 4418@tindex Octal-4 4419@tindex o4 4420This surface corresponds to an octal expression of each quadruple of 4421input bytes, except for the last quadruple, which may be short. 4422 4423It is available in @code{recode} under the name @code{Octal-4} 4424and has @code{o4} for an alias. 4425 4426@item Decimal-1 4427@tindex Decimal-1 4428@tindex d1 4429This surface corresponds to an decimal expression of each input byte. 4430 4431It is available in @code{recode} under the name @code{Decimal-1}, 4432with @code{d1} and @code{d} as acceptable aliases. 4433 4434@item Decimal-2 4435@tindex Decimal-2 4436@tindex d2 4437This surface corresponds to an decimal expression of each pair of 4438input bytes, except for the last pair, which may be short. 4439 4440It is available in @code{recode} under the name @code{Decimal-2} 4441and has @code{d2} for an alias. 4442 4443@item Decimal-4 4444@tindex Decimal-4 4445@tindex d4 4446This surface corresponds to an decimal expression of each quadruple of 4447input bytes, except for the last quadruple, which may be short. 4448 4449It is available in @code{recode} under the name @code{Decimal-4} 4450and has @code{d4} for an alias. 4451 4452@item Hexadecimal-1 4453@tindex Hexadecimal-1 4454@tindex x1 4455This surface corresponds to an hexadecimal expression of each input byte. 4456 4457It is available in @code{recode} under the name @code{Hexadecimal-1}, 4458with @code{x1} and @code{x} as acceptable aliases. 4459 4460@item Hexadecimal-2 4461@tindex Hexadecimal-2 4462@tindex x2 4463This surface corresponds to an hexadecimal expression of each pair of 4464input bytes, except for the last pair, which may be short. 4465 4466It is available in @code{recode} under the name @code{Hexadecimal-2}, 4467with @code{x2} for an alias. 4468 4469@item Hexadecimal-4 4470@tindex Hexadecimal-4 4471@tindex x4 4472This surface corresponds to an hexadecimal expression of each quadruple of 4473input bytes, except for the last quadruple, which may be short. 4474 4475It is available in @code{recode} under the name @code{Hexadecimal-4}, 4476with @code{x4} for an alias. 4477@end table 4478 4479When removing a dump surface, that is, when reading a dump results back 4480into a sequence of bytes, the narrower expression for a short last chunk 4481is recognised, so dumping is a fully reversible operation. However, in 4482case you want to produce dumps by other means than through @code{recode}, 4483beware that for decimal dumps, the library has to rely on the number of 4484spaces to establish the original byte size of the chunk. 4485 4486Although the library might report reversibility errors, removing a dump 4487surface is a rather forgiving process: one may mix bases, group a variable 4488number of data per source line, or use shorter chunks in places other 4489than at the 4490far end. Also, source lines not beginning with a number are skipped. So, 4491@code{recode} should often be able to read a whole C header file, wrapping 4492the results of a previous dump, and regenerate the original byte string. 4493 4494@node Test, , Dump, Surfaces 4495@section Artificial data for testing 4496 4497A few pseudo-surfaces exist to generate debugging data out of thin air. 4498These surfaces are only meant for the expert @code{recode} user, and are 4499only useful in a few contexts, like for generating binary permutations 4500from the recoding or acting on them. 4501 4502@cindex debugging surfaces 4503Debugging surfaces, @emph{when removed}, insert their generated data 4504at the beginning of the output stream, and copy all the input stream 4505after the generated data, unchanged. This strange removal constraint 4506comes from the fact that debugging surfaces are usually specified in the 4507@emph{before} position instead of the @emph{after} position within a request. 4508With debugging surfaces, one often recodes file @file{/dev/null} in filter 4509mode. Specifying many debugging surfaces at once has an accumulation 4510effect on the output, and since surfaces are removed from right to left, 4511each generating its data at the beginning of previous output, the net 4512effect is an @emph{impression} that debugging surfaces are generated from 4513left to right, each appending to the result of the previous. In any case, 4514any real input data gets appended after what was generated. 4515 4516@table @code 4517@item test7 4518@tindex test7 4519When removed, this surface produces 128 single bytes, the first having 4520value 0, the second having value 1, and so forth until all 128 values have 4521been generated. 4522 4523@item test8 4524@tindex test8 4525When removed, this surface produces 256 single bytes, the first having 4526value 0, the second having value 1, and so forth until all 256 values have 4527been generated. 4528 4529@item test15 4530@tindex test15 4531When removed, this surface produces 64509 double bytes, the first having 4532value 0, the second having value 1, and so forth until all values have been 4533generated, but excluding risky @code{UCS-2} values, like all codes from 4534the surrogate @code{UCS-2} area (for @code{UTF-16}), the byte order mark, 4535and values known as invalid @code{UCS-2}. 4536 4537@item test16 4538@tindex test16 4539When removed, this surface produces 65536 double bytes, the first having 4540value 0, the second having value 1, and so forth until all 65536 values 4541have been generated. 4542@end table 4543 4544As an example, the command @samp{recode l5/test8..dump < /dev/null} is a 4545convoluted way to produce an output similar to @samp{recode -lf l5}. It says 4546to generate all possible 256 bytes and interpret them as @code{ISO-8859-9} 4547codes, while converting them to @code{UCS-2}. Resulting @code{UCS-2} 4548characters are dumped one per line, accompanied with their explicative name. 4549 4550@node Internals, Concept Index, Surfaces, Top 4551@chapter Internal aspects 4552 4553@cindex @code{recode} internals 4554@cindex internals 4555The incoming explanations of the internals of @code{recode} should 4556help people who want to dive into @code{recode} sources for adding new 4557charsets. Adding new charsets does not require much knowledge about 4558the overall organisation of @code{recode}. You can rather concentrate 4559of your new charset, letting the remainder of the @code{recode} 4560mechanics take care of interconnecting it with all others charsets. 4561 4562If you intend to play seriously at modifying @code{recode}, beware that 4563you may need some other GNU tools which were not required when you first 4564installing @code{recode}. If you modify or create any @file{.l} file, 4565then you need Flex, and some better @code{awk} like @code{mawk}, 4566GNU @code{awk}, or @code{nawk}. If you modify the documentation (and 4567you should!), you need @code{makeinfo}. If you are really audacious, 4568you may also want Perl for modifying tabular processing, then @code{m4}, 4569Autoconf, Automake and @code{libtool} for adjusting configuration matters. 4570 4571@menu 4572* Main flow:: Overall organisation 4573* New charsets:: Adding new charsets 4574* New surfaces:: Adding new surfaces 4575* Design:: Comments on the library design 4576@end menu 4577 4578@node Main flow, New charsets, Internals, Internals 4579@section Overall organisation 4580@cindex @code{recode}, main flow of operation 4581 4582The @code{recode} mechanics slowly evolved for many years, and it 4583would be tedious to explain all problems I met and mistakes I did all 4584along, yielding the current behaviour. Surely, one of the key choices 4585was to stop trying to do all conversions in memory, one line or one 4586buffer at a time. It has been fruitful to use the character stream 4587paradigm, and the elementary recoding steps now convert a whole stream 4588to another. Most of the control complexity in @code{recode} exists 4589so that each elementary recoding step stays simple, making easier 4590to add new ones. The whole point of @code{recode}, as I see it, is 4591providing a comfortable nest for growing new charset conversions. 4592 4593@cindex single step 4594The main @code{recode} driver constructs, while initialising all 4595conversion modules, a table giving all the conversion routines 4596available (@dfn{single step}s) and for each, the starting charset and 4597the ending charset. If we consider these charsets as being the nodes 4598of a directed graph, each single step may be considered as oriented 4599arc from one node to the other. A cost is attributed to each arc: 4600for example, a high penalty is given to single steps which are prone 4601to losing characters, a lower penalty is given to those which need 4602studying more than one input character for producing an output 4603character, etc. 4604 4605Given a starting code and a goal code, @code{recode} computes the most 4606economical route through the elementary recodings, that is, the best 4607sequence of conversions that will transform the input charset into the 4608final charset. To speed up execution, @code{recode} looks for 4609subsequences of conversions which are simple enough to be merged, and 4610then dynamically creates new single steps to represent these mergings. 4611 4612@cindex double step 4613A @dfn{double step} in @code{recode} is a special concept representing a 4614sequence of two single steps, the output of the first single step being the 4615special charset @code{UCS-2}, the input of the second single step being 4616also @code{UCS-2}. Special @code{recode} machinery dynamically produces 4617efficient, reversible, merge-able single steps out of these double steps. 4618 4619@cindex recoding steps, statistics 4620@cindex average number of recoding steps 4621I made some statistics about how many internal recoding steps are required 4622between any two charsets chosen at random. The initial recoding layout, 4623before optimisation, always uses between 1 and 5 steps. Optimisation could 4624sometimes produce mere copies, which are counted as no steps at all. 4625In other cases, optimisation is unable to save any step. The number of 4626steps after optimisation is currently between 0 and 5 steps. Of course, 4627the @emph{expected} number of steps is affected by optimisation: it drops 4628from 2.8 to 1.8. This means that @code{recode} uses a theoretical average 4629of a bit less than one step per recoding job. This looks good. This was 4630computed using reversible recodings. In strict mode, optimisation might 4631be defeated somewhat. Number of steps run between 1 and 6, both before 4632and after optimisation, and the expected number of steps decreases by a 4633lesser amount, going from 2.2 to 1.3. This is still manageable. 4634 4635@node New charsets, New surfaces, Main flow, Internals 4636@section Adding new charsets 4637@cindex adding new charsets 4638@cindex new charsets, how to add 4639 4640The main part of @code{recode} is written in C, as are most single 4641steps. A few single steps need to recognise sequences of multiple 4642characters, they are often better written in Flex. It is easy for a 4643programmer to add a new charset to @code{recode}. All it requires 4644is making a few functions kept in a single @file{.c} file, 4645adjusting @file{Makefile.am} and remaking @code{recode}. 4646 4647One of the function should convert from any previous charset to the 4648new one. Any previous charset will do, but try to select it so you will 4649not lose too much information while converting. The other function should 4650convert from the new charset to any older one. You do not have to select 4651the same old charset than what you selected for the previous routine. 4652Once again, select any charset for which you will not lose too much 4653information while converting. 4654 4655If, for any of these two functions, you have to read multiple bytes of the 4656old charset before recognising the character to produce, you might prefer 4657programming it in Flex in a separate @file{.l} file. Prototype your 4658C or Flex files after one of those which exist already, so to keep the 4659sources uniform. Besides, at @code{make} time, all @file{.l} files are 4660automatically merged into a single big one by the script @file{mergelex.awk}. 4661 4662There are a few hidden rules about how to write new @code{recode} 4663modules, for allowing the automatic creation of @file{decsteps.h} 4664and @file{initsteps.h} at @code{make} time, or the proper merging of 4665all Flex files. Mimetism is a simple approach which relieves me of 4666explaining all these rules! Start with a module closely resembling 4667what you intend to do. Here is some advice for picking up a model. 4668First decide if your new charset module is to be be driven by algorithms 4669rather than by tables. For algorithmic recodings, see @file{iconqnx.c} for 4670C code, or @file{txtelat1.l} for Flex code. For table driven recodings, 4671see @file{ebcdic.c} for one-to-one style recodings, @file{lat1html.c} 4672for one-to-many style recodings, or @file{atarist.c} for double-step 4673style recodings. Just select an example from the style that better fits 4674your application. 4675 4676Each of your source files should have its own initialisation function, 4677named @code{module_@var{charset}}, which is meant to be executed 4678@emph{quickly} once, prior to any recoding. It should declare the 4679name of your charsets and the single steps (or elementary recodings) 4680you provide, by calling @code{declare_step} one or more times. 4681Besides the charset names, @code{declare_step} expects a description 4682of the recoding quality (see @file{recodext.h}) and two functions you 4683also provide. 4684 4685The first such function has the purpose of allocating structures, 4686pre-conditioning conversion tables, etc. It is also the way of further 4687modifying the @code{STEP} structure. This function is executed if and 4688only if the single step is retained in an actual recoding sequence. 4689If you do not need such delayed initialisation, merely use @code{NULL} 4690for the function argument. 4691 4692The second function executes the elementary recoding on a whole file. 4693There are a few cases when you can spare writing this function: 4694 4695@c FIXME: functions file_one_to_one and file_one_to_many don't exist! 4696@itemize @bullet 4697@item 4698@findex file_one_to_one 4699Some single steps do nothing else than a pure copy of the input onto the 4700output, in this case, you can use the predefined function 4701@code{file_one_to_one}, while having a delayed initialisation for 4702presetting the @code{STEP} field @code{one_to_one} to the predefined 4703value @code{one_to_same}. 4704 4705@item 4706Some single steps are driven by a table which recodes one character into 4707another; if the recoding does nothing else, you can use the predefined 4708function @code{file_one_to_one}, while having a delayed initialisation 4709for presetting the @code{STEP} field @code{one_to_one} with your table. 4710 4711@item 4712@findex file_one_to_many 4713Some single steps are driven by a table which recodes one character into 4714a string; if the recoding does nothing else, you can use the predefined 4715function @code{file_one_to_many}, while having a delayed initialisation 4716for presetting the @code{STEP} field @code{one_to_many} with your table. 4717@end itemize 4718 4719If you have a recoding table handy in a suitable format but do not use 4720one of the predefined recoding functions, it is still a good idea to use 4721a delayed initialisation to save it anyway, because @code{recode} option 4722@samp{-h} will take advantage of this information when available. 4723 4724Finally, edit @file{Makefile.am} to add the source file name of your routines 4725to the @code{C_STEPS} or @code{L_STEPS} macro definition, depending on 4726the fact your routines is written in C or in Flex. 4727 4728@node New surfaces, Design, New charsets, Internals 4729@section Adding new surfaces 4730@cindex adding new surfaces 4731@cindex new surfaces, how to add 4732 4733Adding a new surface is technically quite similar to adding a new charset. 4734@xref{New charsets}. A surface is provided as a set of two transformations: 4735one from the predefined special charset @code{data} or @code{tree} to the 4736new surface, meant to apply the surface, the other from the new surface 4737to the predefined special charset @code{data} or @code{tree}, meant to 4738remove the surface. 4739 4740@findex declare_step 4741Internally in @code{recode}, function @code{declare_step} especially 4742recognises when a charset is so related to @code{data} or @code{tree}, 4743and then takes appropriate actions so that charset gets indeed installed 4744as a surface. 4745 4746@node Design, , New surfaces, Internals 4747@section Comments on the library design 4748 4749@itemize @bullet 4750@item Why a shared library? 4751@cindex shared library implementation 4752 4753There are many different approaches to reduce system requirements to 4754handle all tables needed in the @code{recode} library. One of them is to 4755have the tables in an external format and only read them in on demand. 4756After having pondered this for a while, I finally decided against it, 4757mainly because it involves its own kind of installation complexity, and 4758it is not clear to me that it would be as interesting as I first imagined. 4759 4760It looks more efficient to see all tables and algorithms already mapped 4761into virtual memory from the start of the execution, yet not loaded in 4762actual memory, than to go through many disk accesses for opening various 4763data files once the program is already started, as this would be needed 4764with other solutions. Using a shared library also has the indirect effect 4765of making various algorithms handily available, right in the same modules 4766providing the tables. This alleviates much the burden of the maintenance. 4767 4768Of course, I would like to later make an exception for only a few tables, 4769built locally by users for their own particular needs once @code{recode} 4770is installed. @code{recode} should just go and fetch them. But I do not 4771perceive this as very urgent, yet useful enough to be worth implementing. 4772 4773Currently, all tables needed for recoding are precompiled into binaries, 4774and all these binaries are then made into a shared library. As an initial 4775step, I turned @code{recode} into a main program and a non-shared library, 4776this allowed me to tidy up the API, get rid of all global variables, etc. 4777It required a surprising amount of program source massaging. But once 4778this cleaned enough, it was easy to use Gordon Matzigkeit's @code{libtool} 4779package, and take advantage of the Automake interface to neatly turn the 4780non-shared library into a shared one. 4781 4782Sites linking with the @code{recode} library, whose system does not 4783support any form of shared libraries, might end up with bulky executables. 4784Surely, the @code{recode} library will have to be used statically, and 4785might not very nicely usable on such systems. It seems that progress 4786has a price for those being slow at it. 4787 4788There is a locality problem I did not address yet. Currently, the 4789@code{recode} library takes many cycles to initialise itself, calling 4790each module in turn for it to set up associated knowledge about charsets, 4791aliases, elementary steps, recoding weights, etc. @emph{Then}, the 4792recoding sequence is decided out of the command given. I would not be 4793surprised if initialisation was taking a perceivable fraction of a second 4794on slower machines. One thing to do, most probably not right in version 47953.5, but the version after, would have @code{recode} to pre-load all tables 4796and dump them at installation time. The result would then be compiled and 4797added to the library. This would spare many initialisation cycles, but more 4798importantly, would avoid calling all library modules, scattered through the 4799virtual memory, and so, possibly causing many spurious page exceptions each 4800time the initialisation is requested, at least once per program execution. 4801 4802@item Why not a central charset? 4803 4804It would be simpler, and I would like, if something like @w{ISO 10646} was 4805used as a turning template for all charsets in @code{recode}. Even if 4806I think it could help to a certain extent, I'm still not fully sure it 4807would be sufficient in all cases. Moreover, some people disagree about 4808using @w{ISO 10646} as the central charset, to the point I cannot totally 4809ignore them, and surely, @code{recode} is not a mean for me to force my 4810own opinions on people. I would like that @code{recode} be practical 4811more than dogmatic, and reflect usage more than religions. 4812 4813Currently, if you ask @code{recode} to go from @var{charset1} to 4814@var{charset2} chosen at random, it is highly probable that the best path 4815will be quickly found as: 4816 4817@example 4818@var{charset1}..@code{UCS-2}..@var{charset2} 4819@end example 4820 4821That is, it will almost always use the @code{UCS} as a trampoline between 4822charsets. However, @code{UCS-2} will be immediately be optimised out, 4823and @var{charset1}..@var{charset2} will often be performed in a single 4824step through a permutation table generated on the fly for the circumstance 4825@footnote{If strict mapping is requested, another efficient device will 4826be used instead of a permutation.}. 4827 4828In those few cases where @code{UCS-2} is not selected as a conceptual 4829intermediate, I plan to study if it could be made so. But I guess some cases 4830will remain where @code{UCS-2} is not a proper choice. Even if @code{UCS} is 4831often the good choice, I do not intend to forcefully restrain @code{recode} 4832around @code{UCS-2} (nor @code{UCS-4}) for now. We might come to that 4833one day, but it will come out of the natural evolution of @code{recode}. 4834It will then reflect a fact, rather than a preset dogma. 4835 4836@item Why not @code{iconv}? 4837 4838@cindex @code{iconv} 4839The @code{iconv} routine and library allows for converting characters 4840from an input buffer to an input buffer, synchronously advancing both 4841buffer cursors. If the output buffer is not big enough to receive 4842all of the conversion, the routine returns with the input cursor set at 4843the position where the conversion could later be resumed, and the output 4844cursor set to indicate until where the output buffer has been filled. 4845Despite this scheme is simple and nice, the @code{recode} library does 4846not offer it currently. Why not? 4847 4848When long sequences of decodings, stepwise recodings, and re-encodings 4849are involved, as it happens in true life, synchronising the input buffer 4850back to where it should have stopped, when the output buffer becomes full, 4851is a difficult problem. Oh, we could make it simpler at the expense of 4852loosing space or speed: by inserting markers between each input character 4853and counting them at the output end; by processing only one character in a 4854time through the whole sequence; by repeatedly attempting to recode various 4855subsets of the input buffer, binary searching on their length until the 4856output just fits. The overhead of such solutions looks fully prohibitive 4857to me, and the gain very minimal. I do not see a real advantage, nowadays, 4858imposing a fixed length to an output buffer. It makes things so much 4859simpler and efficient to just let the output buffer size float a bit. 4860 4861Of course, if the above problem was solved, the @code{iconv} library 4862should be easily emulated, given that @code{recode} has similar knowledge 4863about charsets, of course. This either solved or not, the @code{iconv} 4864program remains trivial (given similar knowledge about charsets). 4865I also presume that the @code{genxlt} program would be easy too, but 4866I do not have enough detailed specifications of it to be sure. 4867 4868A lot of years ago, @code{recode} was using a similar scheme, and I found 4869it rather hard to manage for some cases. I rethought the overall structure 4870of @code{recode} for getting away from that scheme, and never regretted it. 4871I perceive @code{iconv} as an artificial solution which surely has some 4872elegances and virtues, but I do not find it really useful as it stands: one 4873always has to wrap @code{iconv} into something more refined, extending it 4874for real cases. From past experience, I think it is unduly hard to fully 4875implement this scheme. It would be awkward that we do contortions for 4876the sole purpose of implementing exactly its specification, without real, 4877fairly sounded reasons (other then the fact some people once thought it 4878was worth standardising). It is much better to immediately aim for the 4879refinement we need, without uselessly forcing us into the dubious detour 4880@code{iconv} represents. 4881 4882Some may argue that if @code{recode} was using a comprehensive charset 4883as a turning template, as discussed in a previous point, this would make 4884@code{iconv} easier to implement. Some may be tempted to say that the 4885cases which are hard to handle are not really needed, nor interesting, 4886anyway. I feel and fear a bit some pressure wanting that @code{recode} 4887be split into the part that well fits the @code{iconv} model, and the part 4888that does not fit, considering this second part less important, with the 4889idea of dropping it one of these days, maybe. My guess is that users of 4890the @code{recode} library, whatever its form, would not like to have such 4891arbitrary limitations. In the long run, we should not have to explain 4892to our users that some recodings may not be made available just because 4893they do not fit the simple model we had in mind when we did it. Instead, 4894we should try to stay opened to the difficulties of real life. There is 4895still a lot of complex needs for Asian people, say, that @code{recode} 4896does not currently address, while it should. Not only the doors should 4897stay open, but we should force them wider! 4898@end itemize 4899 4900@node Concept Index, Option Index, Internals, Top 4901@unnumbered Concept Index 4902 4903@printindex cp 4904 4905@node Option Index, Library Index, Concept Index, Top 4906@unnumbered Option Index 4907 4908This is an alphabetical list of all command-line options accepted by 4909@code{recode}. 4910 4911@printindex op 4912 4913@node Library Index, Charset and Surface Index, Option Index, Top 4914@unnumbered Library Index 4915 4916This is an alphabetical index of important functions, data structures, 4917and variables in the @code{recode} library. 4918 4919@printindex fn 4920 4921@node Charset and Surface Index, , Library Index, Top 4922@unnumbered Charset and Surface Index 4923 4924This is an alphabetical list of all the charsets and surfaces supported 4925by @code{recode}, and their aliases. 4926 4927@printindex tp 4928 4929@contents 4930@bye 4931 4932@c Local Variables: 4933@c texinfo-column-for-description: 24 4934@c End: 4935