1\input texinfo        @c -*-texinfo-*-          -*- coding: latin-1 -*-
2@c %**start of header
3@setfilename recode.info
4@settitle The @code{recode} reference manual
5
6@c An index for command-line options
7@defcodeindex op
8@c Put variable and function names together
9@syncodeindex vr fn
10@finalout
11@c %**end of header
12
13@include version.texi
14
15@dircategory Internationalization and character sets
16@direntry
17* recode: (recode).     Conversion between character sets and surfaces.
18@end direntry
19
20@ifinfo
21This file documents the @code{recode} command, which has the purpose of
22converting files between various character sets and surfaces.
23
24Copyright (C) 1990, 93, 94, 96, 97, 98, 99, 00 Free Software Foundation, Inc.
25
26Permission is granted to make and distribute verbatim copies of
27this manual provided the copyright notice and this permission notice
28are preserved on all copies.
29
30@ignore
31Permission is granted to process this file through TeX and print the
32results, provided the printed document carries copying permission
33notice identical to this one except for the removal of this paragraph
34(this paragraph not being relevant to the printed manual).
35
36@end ignore
37Permission is granted to copy and distribute modified versions of this
38manual under the conditions for verbatim copying, provided that the entire
39resulting derived work is distributed under the terms of a permission
40notice identical to this one.
41
42Permission is granted to copy and distribute translations of this manual
43into another language, under the above conditions for modified versions,
44except that this permission notice may be stated in a translation approved
45by the Foundation.
46@end ifinfo
47
48@titlepage
49@title Free recode, version @value{VERSION}
50@subtitle The character set converter
51@subtitle Edition @value{EDITION}, @value{UPDATED}
52@author Fran@,{c}ois Pinard
53
54@page
55@vskip 0pt plus 1filll
56Copyright @copyright{} 1993, 94, 97, 98, 99, 00 Free Software Foundation, Inc.
57
58Permission is granted to make and distribute verbatim copies of
59this manual provided the copyright notice and this permission notice
60are preserved on all copies.
61
62Permission is granted to copy and distribute modified versions of this
63manual under the conditions for verbatim copying, provided that the entire
64resulting derived work is distributed under the terms of a permission
65notice identical to this one.
66
67Permission is granted to copy and distribute translations of this manual
68into another language, under the above conditions for modified versions,
69except that this permission notice may be stated in a translation approved
70by the Foundation.
71@end titlepage
72
73@ifnottex
74@node Top, Tutorial, (dir), (dir)
75@top @code{recode}
76
77@c @item @b{@code{recode}} @value{hfillkludge} (UtilT, SrcCD)
78@c
79This recoding library converts files between various coded character
80sets and surface encodings.  When this cannot be achieved exactly, it
81may get rid of the offending characters or fall back on approximations.
82The library recognises or produces more than 300 different character sets
83and is able to convert files between almost any pair.  Most @w{RFC 1345}
84character sets, and all @code{libiconv} character sets, are supported.
85The @code{recode} program is a handy front-end to the library.
86
87The current @code{recode} release is @value{VERSION}.
88
89@menu
90* Tutorial::            Quick Tutorial
91* Introduction::        Terminology and purpose
92* Invoking recode::     How to use this program
93* Library::             A recoding library
94* Universal::           The universal charset
95* libiconv::            The @code{iconv} library
96* Tabular::             Tabular sources (@w{RFC 1345})
97* ASCII misc::          ASCII and some derivatives
98* IBM and MS::          Some IBM or Microsoft charsets
99* CDC::                 Charsets for CDC machines
100* Micros::              Other micro-computer charsets
101* Miscellaneous::       Various other charsets
102* Surfaces::            All about surfaces
103* Internals::           Internal aspects
104* Concept Index::       Concept Index
105* Option Index::        Option Index
106* Library Index::       Library Index
107* Charset and Surface Index::  Charset and Surface Index
108
109@detailmenu
110 --- The Detailed Node Listing ---
111
112Terminology and purpose
113
114* Charset overview::    Overview of charsets
115* Surface overview::    Overview of surfaces
116* Contributing::        Contributions and bug reports
117
118How to use this program
119
120* Synopsis::            Synopsis of @code{recode} call
121* Requests::            The @var{request} parameter
122* Listings::            Asking for various lists
123* Recoding::            Controlling how files are recoded
124* Reversibility::       Reversibility issues
125* Sequencing::          Selecting sequencing methods
126* Mixed::               Using mixed charset input
127* Emacs::               Using @code{recode} within Emacs
128* Debugging::           Debugging considerations
129
130A recoding library
131
132* Outer level::         Outer level functions
133* Request level::       Request level functions
134* Task level::          Task level functions
135* Charset level::       Charset level functions
136* Errors::              Handling errors
137
138The universal charset
139
140* UCS-2::               Universal Character Set, 2 bytes
141* UCS-4::               Universal Character Set, 4 bytes
142* UTF-7::               Universal Transformation Format, 7 bits
143* UTF-8::               Universal Transformation Format, 8 bits
144* UTF-16::              Universal Transformation Format, 16 bits
145* count-characters::    Frequency count of characters
146* dump-with-names::     Fully interpreted UCS dump
147
148ASCII and some derivatives
149
150* ASCII::               Usual ASCII
151* ISO 8859::            ASCII extended by Latin Alphabets
152* ASCII-BS::            ASCII 7-bits, @kbd{BS} to overstrike
153* flat::                ASCII without diacritics nor underline
154
155Some IBM or Microsoft charsets
156
157* EBCDIC::              EBCDIC codes
158* IBM-PC::              IBM's PC code
159* Icon-QNX::            Unisys' Icon code
160
161Charsets for CDC machines
162
163* Display Code::        Control Data's Display Code
164* CDC-NOS::             ASCII 6/12 from NOS
165* Bang-Bang::           ASCII ``bang bang''
166
167Other micro-computer charsets
168
169* Apple-Mac::           Apple's Macintosh code
170* AtariST::             Atari ST code
171
172Various other charsets
173
174* HTML::                World Wide Web representations
175* LaTeX::               LaTeX macro calls
176* Texinfo::             GNU project documentation files
177* Vietnamese::          Vietnamese charsets
178* African::             African charsets
179* Others::              Cyrillic and other charsets
180* Texte::               Easy French conventions
181* Mule::                Mule as a multiplexed charset
182
183All about surfaces
184
185* Permutations::        Permuting groups of bytes
186* End lines::           Representation for end of lines
187* MIME::                MIME contents encodings
188* Dump::                Interpreted character dumps
189* Test::                Artificial data for testing
190
191Internal aspects
192
193* Main flow::           Overall organisation
194* New charsets::        Adding new charsets
195* New surfaces::        Adding new surfaces
196* Design::              Comments on the library design
197
198@end detailmenu
199@end menu
200
201@end ifnottex
202
203@node Tutorial, Introduction, Top, Top
204@chapter Quick Tutorial
205
206@cindex @code{recode} use, a tutorial
207@cindex tutorial
208So, really, you just are in a hurry to use @code{recode}, and do not
209feel like studying this manual?  Even reading this paragraph slows you down?
210We might have a problem, as you will have to do some guess work, and might
211not become very proficient unless you have a very solid intuition@dots{}.
212
213Let me use here, as a quick tutorial, an actual reply of mine to a
214@code{recode} user, who writes:
215
216@quotation
217My situation is this---I occasionally get email with special characters
218in it.  Sometimes this mail is from a user using IBM software and sometimes
219it is a user using Mac software.  I myself am on a SPARC Solaris machine.
220@end quotation
221
222Your situation is similar to mine, except that I @emph{often} receive
223email needing recoding, that is, much more than @emph{occasionally}!
224The usual recodings I do are Mac to @w{Latin-1}, IBM page codes to @w{Latin-1},
225Easy-French to @w{Latin-1}, remove Quoted-Printable, remove Base64.  These are
226so frequent that I made myself a few two-keystroke Emacs commands to filter
227the Emacs region.  This is very convenient for me.  I also resort to many
228other email conversions, yet more rarely than the frequent cases above.
229
230@quotation
231It @emph{seems} like this should be doable using @code{recode}.  However,
232when I try something like @samp{grecode mac macfile.txt} I get nothing
233out---no error, no output, nothing.
234@end quotation
235
236Presuming you are using some recent version of @code{recode}, the command:
237
238@example
239recode mac macfile.txt
240@end example
241
242@noindent
243is a request for recoding @file{macfile.txt} over itself, overwriting the
244original, from Macintosh usual character code and Macintosh end of lines,
245to @w{Latin-1} and Unix end of lines.  This is overwrite mode.  If you want
246to use @code{recode} as a filter, which is probably what you need, rather do:
247
248@example
249recode mac
250@end example
251
252@noindent
253and give your Macintosh file as standard input, you'll get the @w{Latin-1}
254file on standard output.  The above command is an abbreviation for any of:
255
256@example
257recode mac..
258recode mac..l1
259recode mac..Latin-1
260recode mac/CR..Latin-1/
261recode Macintosh..ISO_8859-1
262recode Macintosh/CR..ISO_8859-1/
263@end example
264
265That is, a @code{CR} surface, encoding newlines with ASCII @key{CR}, is
266first to be removed (this is a default surface for @samp{mac}), then the
267Macintosh charset is converted to @w{Latin-1} and no surface is added to the
268result (there is no default surface for @samp{l1}).  If you want @samp{mac}
269code converted, but you know that newlines are already coded the Unix way,
270just do:
271
272@example
273recode mac/
274@end example
275
276@noindent
277the slash then overriding the default surface with empty, that is, none.
278Here are other easy recipes:
279
280@example
281recode pc          to filter IBM-PC code and CR-LF (default) to Latin-1
282recode pc/         to filter IBM-PC code to Latin-1
283recode 850         to filter code page 850 and CR-LF (default) to Latin-1
284recode 850/        to filter code page 850 to Latin-1
285recode /qp         to remove quoted printable
286@end example
287
288The last one is indeed equivalent to any of:
289
290@example
291recode /qp..
292recode l1/qp..l1/
293recode ISO_8859-1/Quoted-Printable..ISO_8859-1/
294@end example
295
296Here are some reverse recipes:
297
298@example
299recode ..mac       to filter Latin-1 to Macintosh code and CR (default)
300recode ..mac/      to filter Latin-1 to Macintosh code
301recode ..pc        to filter Latin-1 to IBM-PC code and CR-LF (default)
302recode ..pc/       to filter Latin-1 to IBM-PC code
303recode ..850       to filter Latin-1 to code page 850 and CR-LF (default)
304recode ..850/      to filter Latin-1 to code page 850
305recode ../qp       to force quoted printable
306@end example
307
308In all the above calls, replace @samp{recode} by @samp{recode -f} if you
309want to proceed despite recoding errors.  If you do not use @samp{-f}
310and there is an error, the recoding output will be interrupted after first
311error in filter mode, or the file will not be replaced by a recoded copy
312in overwrite mode.
313
314You may use @samp{recode -l} to get a list of available charsets and
315surfaces, and @samp{recode --help} to get a quick summary of options.
316The above output is meant for those having already read this manual, so
317let me dare a suggestion: why could not you find a few more minutes in
318your schedule to peek further down, right into the following chapters!
319
320@node Introduction, Invoking recode, Tutorial, Top
321@chapter Terminology and purpose
322
323A few terms are used over and over in this manual, our wise reader will
324learn their meaning right away.  Both ISO (International Organization for
325Standardisation) and IETF (Internet Engineering Task Force) have their
326own terminology, this document does not try to stick to either one in a
327strict way, while it does not want to throw more confusion in the field.
328On the other hand, it would not be efficient using paraphrases all the time,
329so @code{recode} coins a few short words, which are explained below.
330
331@cindex charset, what it is
332A @dfn{charset}, in the context of @code{recode}, is a particular association
333between computer codes on one side, and a repertoire of intended characters
334on the other side.  Codes are usually taken from a set of consecutive
335small integers, starting at 0.  Some characters have a graphical appearance
336(glyph) or displayable effect, others have special uses like, for example,
337to control devices or to interact with neighbouring codes to specify them
338more precisely.  So, a @emph{charset} is roughly one of those tables,
339giving a meaning to each of the codes from the set of allowable values.
340MIME also uses the term charset with approximately the same meaning.
341It does @emph{not} exactly corresponds to what ISO calls a @dfn{coded
342character set}, that is, a set of characters with an encoding for them.
343An coded character set does not necessarily use all available code positions,
344while a MIME charset usually tries to specify them all.  A MIME charset
345might be the union of a few disjoint coded character sets.
346
347@cindex surface, what it is
348A @dfn{surface} is a term used in @code{recode} only, and is a short for
349surface transformation of a charset stream.  This is any kind of mapping,
350usually reversible, which associates physical bits in some medium for
351a stream of characters taken from one or more charsets (usually one).
352A surface is a kind of varnish added over a charset so it fits in actual
353bits and bytes.  How end of lines are exactly encoded is not really
354pertinent to the charset, and so, there is surface for end of lines.
355@code{Base64} is also a surface, as we may encode any charset in it.
356Other examples would @code{DES} enciphering, or @code{gzip} compression
357(even if @code{recode} does not offer them currently): these are ways to give
358a real life to theoretical charsets.  The @dfn{trivial} surface consists
359into putting characters into fixed width little chunks of bits, usually
360eight such bits per character.  But things are not always that simple.
361
362This @code{recode} library, and the program by that name, have the purpose
363of converting files between various charsets and surfaces.  When this
364cannot be done in exact ways, as it is often the case, the program may
365get rid of the offending characters or fall back on approximations.
366This library recognises or produces around 175 such charsets under 500
367names, and handle a dozen surfaces.  Since it can convert each charset to
368almost any other one, many thousands of different conversions are possible.
369
370The @code{recode} program and library do not usually know how to split and
371sort out textual and non-textual information which may be mixed in a single
372input file.  For example, there is no surface which currently addresses the
373problem of how lines are blocked into physical records, when the blocking
374information is added as binary markers or counters within files.  So,
375@code{recode} should be given textual streams which are rather @emph{pure}.
376
377This tool pays special attention to superimposition of diacritics for
378some French representations.  This orientation is mostly historical, it
379does not impair the usefulness, generality or extensibility of the program.
380@samp{recode} is both a French and English word.  For those who pay attention
381to those things, the proper pronunciation is French (that is, @samp{racud},
382with @samp{a} like in @samp{above}, and @samp{u} like in @samp{cut}).
383
384The program @code{recode} has been written by Fran@,{c}ois Pinard.
385With time, it got to reuse works from other contributors, and notably,
386those of Keld Simonsen and Bruno Haible.
387
388@menu
389* Charset overview::    Overview of charsets
390* Surface overview::    Overview of surfaces
391* Contributing::        Contributions and bug reports
392@end menu
393
394@node Charset overview, Surface overview, Introduction, Introduction
395@section Overview of charsets
396
397@cindex charsets, overview
398Recoding is currently possible between many charsets, the bulk of which is
399described by @w{RFC 1345} tables or available in the @code{iconv} library.
400@xref{Tabular}, and @pxref{libiconv}.  The @code{recode} library also
401handles some charsets in some specialised ways.  These are:
402
403@itemize @bullet
404@item
4056-bit charsets based on CDC display code: 6/12 code from NOS; bang-bang
406code from Universit@'e de Montr@'eal;
407
408@item
4097-bit ASCII: without any diacritics, or else: using backspace for
410overstriking; Unisys' Icon convention; @TeX{}/La@TeX{} coding; easy
411French conventions for electronic mail;
412
413@item
4148-bit extensions to ASCII: ISO @w{Latin-1}, Atari ST code, IBM's code for
415the PC, Apple's code for the Macintosh;
416
417@item
4188-bit non-ASCII codes: three flavours of EBCDIC;
419
420@item
42116-bit or 31-bit universal characters, and their transfer encodings.
422@end itemize
423
424The introduction of @w{RFC 1345} in @code{recode} has brought with it a few
425charsets having the functionality of older ones, but yet being different
426in subtle ways.  The effects have not been fully investigated yet, so for
427now, clashes are avoided, the old and new charsets are kept well separate.
428
429@cindex unavailable conversions
430@cindex conversions, unavailable
431@cindex impossible conversions
432@cindex unreachable charsets
433@cindex exceptions to available conversions
434@cindex pseudo-charsets
435@tindex flat@r{, not as before charset}
436@tindex count-characters@r{, not as before charset}
437@tindex dump-with-names@r{, not as before charset}
438@tindex data@r{, not with charsets}
439@tindex libiconv@r{, not in requests}
440Conversion is possible between almost any pair of charsets.  Here is a
441list of the exceptions.  One may not recode @emph{from} the @code{flat},
442@code{count-characters} or @code{dump-with-names} charsets, nor @emph{from}
443or @emph{to} the @code{data}, @code{tree} or @code{:libiconv:} charsets.
444Also, if we except the @code{data} and @code{tree} pseudo-charsets, charsets
445and surfaces live in disjoint recoding spaces, one cannot really transform
446a surface into a charset or vice-versa, as surfaces are only meant to be
447applied over charsets, or removed from them.
448
449@node Surface overview, Contributing, Charset overview, Introduction
450@section Overview of surfaces
451
452@cindex surfaces, overview
453For various practical considerations, it sometimes happens that the codes
454making up a text, written in a particular charset, cannot simply be put
455out in a file one after another without creating problems or breaking
456other things.  Sometimes, 8-bit codes cannot be written on a 7-bit medium,
457variable length codes need kind of envelopes, newlines require special
458treatment, etc.  We sometimes have to apply @dfn{surfaces} to a stream
459of codes, which surfaces are kind of tricks used to fit the charset into
460those practical constraints.  Moreover, similar surfaces or tricks may
461be useful for many unrelated charsets, and many surfaces can be used at
462once over a single charset.
463
464@cindex pure charset
465@cindex charset, pure
466So, @code{recode} has machinery to describe a combination of a charset with
467surfaces used over it in a file.  We would use the expression @dfn{pure
468charset} for referring to a charset free of any surface, that is, the
469conceptual association between integer codes and character intents.
470
471It is not always clear if some transformation will yield a charset or a
472surface, especially for those transformations which are only meaningful
473over a single charset.  The @code{recode} library is not overly picky as
474identifying surfaces as such: when it is practical to consider a specialised
475surface as if it were a charset, this is preferred, and done.
476
477@node Contributing,  , Surface overview, Introduction
478@section Contributions and bug reports
479
480@cindex contributing charsets
481Even being the @code{recode} author and current maintainer, I am no
482specialist in charset standards.  I only made @code{recode} along the
483years to solve my own needs, but felt it was applicable for the needs
484of others.  Some FSF people liked the program structure and suggested
485to make it more widely available.  I often rely on @code{recode} users
486suggestions to decide what is best to be done next.
487
488Properly protecting @code{recode} about possible copyright fights is a
489pain for me and for contributors, but we cannot avoid addressing the issue
490in the long run.  Besides, the Free Software Foundation, which mandates
491the GNU project, is very sensible to this matter.  GNU standards suggest
492that we stay cautious before looking at copyrighted code.  The safest and
493simplest way for me is to gather ideas and reprogram them anew, even if
494this might slow me down considerably.  For contributions going beyond a
495few lines of code here and there, the FSF definitely requires employer
496disclaimers and copyright assignments in writing.
497
498When you contribute something to @code{recode}, @emph{please} explain what
499it is about.  Do not take for granted that I know those charsets which
500are familiar to you.  Once again, I'm no expert, and you have to help me.
501Your explanations could well find their way into this documentation, too.
502Also, for contributing new charsets or new surfaces, as much as possible,
503please provide good, solid, verifiable references for the tables you
504used@footnote{I'm not prone at accepting a charset you just invented,
505and which nobody uses yet: convince your friends and community first!}.
506
507Many users contributed to @code{recode} already, I am grateful to them for
508their interest and involvement.  Some suggestions can be integrated quickly
509while some others have to be delayed, I have to draw a line somewhere when
510time comes to make a new release, about what would go in it and what would
511go in the next.
512
513@cindex bug reports, where to send
514@cindex reporting bugs
515Please send suggestions, documentation errors and bug reports to
516@email{recode-bugs@@iro.umontreal.ca} or, if you prefer, directly to
517@email{pinard@@iro.umontreal.ca}, Fran@,{c}ois Pinard.  Do not be afraid
518to report details, because this program is the mere aggregation of
519hundreds of details.
520
521@node Invoking recode, Library, Introduction, Top
522@chapter How to use this program
523
524With the synopsis of the @code{recode} call, we stress the difference
525between using this program as a file filter, or recoding many files
526at once.  The first parameter of any call states the recoding request,
527and this deserves a section on its own.  Options are then presented,
528but somewhat grouped according to the related functionalities they
529control.
530
531@menu
532* Synopsis::            Synopsis of @code{recode} call
533* Requests::            The @var{request} parameter
534* Listings::            Asking for various lists
535* Recoding::            Controlling how files are recoded
536* Reversibility::       Reversibility issues
537* Sequencing::          Selecting sequencing methods
538* Mixed::               Using mixed charset input
539* Emacs::               Using @code{recode} within Emacs
540* Debugging::           Debugging considerations
541@end menu
542
543@node Synopsis, Requests, Invoking recode, Invoking recode
544@section Synopsis of @code{recode} call
545
546@cindex @code{recode}, synopsis of invocation
547@cindex invocation of @code{recode}, synopsis
548The general format of the program call is one of:
549
550@example
551recode [@var{option}]@dots{} [@var{charset} | @var{request} [@var{file}]@dots{} ]
552@end example
553
554Some calls are used only to obtain lists produced by @code{recode} itself,
555without actually recoding any file.  They are recognised through the
556usage of listing options, and these options decide what meaning should
557be given to an optional @var{charset} parameter.  @xref{Listings}.
558
559In other calls, the first parameter (@var{request}) always explains which
560transformations are expected on the files.  There are many variations to
561the aspect of this parameter.  We will discuss more complex situations
562later (@pxref{Requests}), but for many simple cases, this parameter
563merely looks like this@footnote{In previous versions or @code{recode}, a single
564colon @samp{:} was used instead of the two dots @samp{..} for separating
565charsets, but this was creating problems because colons are allowed in
566official charset names.  The old request syntax is still recognised for
567compatibility purposes, but is deprecated.}:
568
569@example
570@var{before}..@var{after}
571@end example
572
573@noindent
574where @var{before} and @var{after} each gives the name of a charset.  Each
575@var{file} will be read assuming it is coded with charset @var{before}, it
576will be recoded over itself so to use the charset @var{after}.  If there
577is no @var{file} on the @code{recode} command, the program rather acts
578as a Unix filter and transforms standard input onto standard output.
579@cindex filter operation
580@cindex @code{recode}, operation as filter
581
582The capability of recoding many files at once is very convenient.
583For example, one could easily prepare a distribution from @w{Latin-1} to MSDOS,
584this way:
585
586@example
587mkdir package
588cp -p Makefile *.[ch] package
589recode Latin-1..MSDOS package/*
590zoo ah package.zoo package/*
591rm -rf package
592@end example
593
594@noindent
595(In this example, the non-mandatory @samp{-p} option to @code{cp} is for
596preserving timestamps, and the @code{zoo} program is an archiver from
597Rahul Dhesi which once was quite popular.)
598
599The filter operation is especially useful when the input files should
600not be altered.  Let us make an example to illustrate this point.
601Suppose that someone has a file named @file{datum.txt}, which is almost
602a @TeX{} file, except that diacriticised characters are written using
603@w{Latin-1}.  To complete the recoding of the diacriticised characters
604@emph{only} and produce a file @file{datum.tex}, without destroying
605the original, one could do:
606
607@example
608cp -p datum.txt datum.tex
609recode -d l1..tex datum.tex
610@end example
611
612However, using @code{recode} as a filter will achieve the same goal more
613neatly:
614
615@example
616recode -d l1..tex <datum.txt >datum.tex
617@end example
618
619This example also shows that @code{l1} could be used instead of
620@code{Latin-1}; charset names often have such aliases.
621
622@node Requests, Listings, Synopsis, Invoking recode
623@section The @var{request} parameter
624
625In the case where the @var{request} is merely written as
626@var{before}..@var{after}, then @var{before} and @var{after} specify the
627start charset and the goal charset for the recoding.
628
629@cindex charset names, valid characters
630@cindex valid characters in charset names
631For @code{recode}, charset names may contain any character, besides a
632comma, a forward slash, or two periods in a row.  But in practice, charset
633names are currently limited to alphabetic letters (upper or lower case),
634digits, hyphens, underlines, periods, colons or round parentheses.
635
636@cindex request, syntax
637@cindex @code{recode} request syntax
638The complete syntax for a valid @var{request} allows for unusual
639things, which might surprise at first.  (Do not pay too much attention
640to these facilities on first reading.)  For example, @var{request}
641may also contain intermediate charsets, like in the following example:
642
643@example
644@var{before}..@var{interim1}..@var{interim2}..@var{after}
645@end example
646
647@noindent
648@cindex intermediate charsets
649@cindex chaining of charsets in a request
650@cindex charsets, chaining in a request
651meaning that @code{recode} should internally produce the @var{interim1}
652charset from the start charset, then work out of this @var{interim1}
653charset to internally produce @var{interim2}, and from there towards the
654goal charset.  In fact, @code{recode} internally combines recipes and
655automatically uses interim charsets, when there is no direct recipe for
656transforming @var{before} into @var{after}.  But there might be many ways
657to do it.  When many routes are possible, the above @dfn{chaining} syntax
658may be used to more precisely force the program towards a particular route,
659which it might not have naturally selected otherwise.  On the other hand,
660because @code{recode} tries to choose good routes, chaining is only needed
661to achieve some rare, unusual effects.
662
663Moreover, many such requests (sub-requests, more precisely) may be
664separated with commas (but no spaces at all), indicating a sequence
665of recodings, where the output of one has to serve as the input of the
666following one.  For example, the two following requests are equivalent:
667
668@example
669@var{before}..@var{interim1}..@var{interim2}..@var{after}
670@var{before}..@var{interim1},@var{interim1}..@var{interim2},@var{interim2}..@var{after}
671@end example
672
673@noindent
674In this example, the charset input for any recoding sub-request is identical
675to the charset output by the preceding sub-request.  But it does not have
676to be so in the general case.  One might wonder what would be the meaning
677of declaring the charset input for a recoding sub-request of being of
678different nature than the charset output by a preceding sub-request, when
679recodings are chained in this way.  Such a strange usage might have a
680meaning and be useful for the @code{recode} expert, but they are quite
681uncommon in practice.
682
683@cindex surfaces, syntax
684More useful is the distinction between the concept of charset, and
685the concept of surfaces.  An encoded charset is represented by:
686
687@example
688@var{pure-charset}/@var{surface1}/@var{surface2}@dots{}
689@end example
690
691@noindent
692@cindex surfaces, commutativity
693@cindex commutativity of surfaces
694using slashes to introduce surfaces, if any.  The order of application
695of surfaces is usually important, they cannot be freely commuted.  In the
696given example, @var{surface1} is first applied over the @var{pure-charset},
697then @var{surface2} is applied over the result.  Given this request:
698
699@example
700@var{before}/@var{surface1}/@var{surface2}..@var{after}/@var{surface3}
701@end example
702
703@noindent
704the @code{recode} program will understand that the input files should
705have @var{surface2} removed first (because it was applied last), then
706@var{surface1} should be removed.  The next step will be to translate the
707codes from charset @var{before} to charset @var{after}, prior to applying
708@var{surface3} over the result.
709
710@cindex implied surfaces
711@cindex surfaces, implied
712@tindex IBM-PC charset, and CR-LF surface
713Some charsets have one or more @emph{implied} surfaces.  In this case, the
714implied surfaces are automatically handled merely by naming the charset,
715without any explicit surface to qualify it.  Let's take an example to
716illustrate this feature.  The request @samp{pc..l1} will indeed decode MS-DOS
717end of lines prior to converting IBM-PC codes to @w{Latin-1}, because @samp{pc}
718is the name of a charset@footnote{More precisely, @code{pc} is an alias for
719the charset @code{IBM-PC}.} which has @code{CR-LF} for its usual surface.
720The request @samp{pc/..l1} will @emph{not} decode end of lines, since
721the slash introduces surfaces, and even if the surface list is empty, it
722effectively defeats the automatic removal of surfaces for this charset.
723So, empty surfaces are useful, indeed!
724
725@cindex aliases
726@cindex alternate names for charsets and surfaces
727@cindex charsets, aliases
728@cindex surfaces, aliases
729Both charsets and surfaces may have predefined alternate names, or aliases.
730However, and this is rather important to understand, implied surfaces
731are attached to individual aliases rather than on genuine charsets.
732Consequently, the official charset name and all of its aliases do not
733necessarily share the same implied surfaces.  The charset and all its
734aliases may each have its own different set of implied surfaces.
735
736@cindex abbreviated names for charsets and surfaces
737@cindex names of charsets and surfaces, abbreviation
738Charset names, surface names, or their aliases may always be abbreviated
739to any unambiguous prefix.  Internally in @code{recode}, disambiguating
740tables are kept separate for charset names and surface names.
741
742@cindex letter case, in charset and surface names
743While recognising a charset name or a surface name (or aliases thereof),
744@code{recode} ignores all characters besides letters and digits, so for
745example, the hyphens and underlines being part of an official charset
746name may safely be omitted (no need to un-confuse them!).  There is also
747no distinction between upper and lower case for charset or surface names.
748
749One of the @var{before} or @var{after} keywords may be omitted.  If the
750double dot separator is omitted too, then the charset is interpreted as
751the @var{before} charset.@footnote{Both @var{before} and @var{after} may
752be omitted, in which case the double dot separator is mandatory.  This is
753not very useful, as the recoding reduces to a mere copy in that case.}
754
755@cindex default charset
756@cindex charset, default
757@vindex DEFAULT_CHARSET
758When a charset name is omitted or left empty, the value of the
759@code{DEFAULT_CHARSET} variable in the environment is used instead.  If this
760variable is not defined, the @code{recode} library uses the current locale's
761encoding. On POSIX compliant systems, this depends on the first non-empty
762value among the environment variables LC_ALL, LC_CTYPE, LANG, and can be
763determined through the command @samp{locale charmap}.
764
765If the charset name is omitted but followed by surfaces, the surfaces
766then qualify the usual or default charset.  For example, the request
767@samp{../x} is sufficient for applying an hexadecimal surface to the input
768text@footnote{MS-DOS is one of those systems for which the default charset
769has implied surfaces, @code{CR-LF} here.  Such surfaces are automatically
770removed or applied whenever the default charset is read or written,
771exactly as it would go for any other charset.  In the example above, on
772such systems, the hexadecimal surface would then @emph{replace} the implied
773surfaces.  For @emph{adding} an hexadecimal surface without removing any,
774one should write the request as @samp{/../x}.}.
775
776The allowable values for @var{before} or @var{after} charsets, and various
777surfaces, are described in the remainder of this document.
778
779@node Listings, Recoding, Requests, Invoking recode
780@section Asking for various lists
781
782Many options control listing output generated by @code{recode} itself,
783they are not meant to accompany actual file recodings.  These options are:
784
785@table @samp
786
787@item --version
788@opindex --version
789@cindex @code{recode} version, printing
790The program merely prints its version numbers on standard output, and
791exits without doing anything else.
792
793@item --help
794@opindex --help
795@cindex help page, printing
796The program merely prints a page of help on standard output, and exits
797without doing any recoding.
798
799@item -C
800@itemx --copyright
801@opindex -C
802@opindex --copyright
803@cindex copyright conditions, printing
804Given this option, all other parameters and options are ignored.  The
805program prints briefly the copyright and copying conditions.  See the
806file @file{COPYING} in the distribution for full statement of the
807Copyright and copying conditions.
808
809@item -h[@var{language}/][@var{name}]
810@itemx --header[=[@var{language}/][@var{name}]]
811@opindex -h
812@opindex --header
813@cindex source file generation
814@cindex programming language support
815@cindex languages, programming
816@cindex supported programming languages
817Instead of recoding files, @code{recode} writes a @var{language} source
818file on standard output and exits.  This source is meant to be included
819in a regular program written in the same programming @var{language}:
820its purpose is to declare and initialise an array, named @var{name},
821which represents the requested recoding.  The only acceptable values for
822@var{language} are @samp{c} or @samp{perl}, and may may be abbreviated.
823If @var{language} is not specified, @samp{c} is assumed.  If @var{name}
824is not specified, then it defaults to @samp{@var{before}_@var{after}}.
825Strings @var{before} and @var{after} are cleaned before being used according
826to the syntax of @var{language}.
827
828Even if @code{recode} tries its best, this option does not always succeed in
829producing the requested source table.  It will however, provided the recoding
830can be internally represented by only one step after the optimisation phase,
831and if this merged step conveys a one-to-one or a one-to-many explicit
832table.  Also, when attempting to produce sources tables, @code{recode}
833relaxes its checking a tiny bit: it ignores the algorithmic part of some
834tabular recodings, it also avoids the processing of implied surfaces.
835But this is all fairly technical.  Better try and see!
836
837Beware that other options might affect the produced source tables, these
838are: @samp{-d}, @samp{-g} and, particularly, @samp{-s}.
839
840@item -k @var{pairs}
841@itemx --known=@var{pairs}
842@opindex -k
843@opindex --known=
844@cindex unknown charsets
845@cindex guessing charsets
846@cindex charsets, guessing
847This particular option is meant to help identifying an unknown charset,
848using as hints some already identified characters of the charset.  Some
849examples will help introducing the idea.
850
851Let's presume here that @code{recode} is run in an ISO-8859-1 locale, and
852that @code{DEFAULT_CHARSET} is unset in the environment.
853Suppose you have guessed that code 130 (decimal) of the unknown charset
854represents a lower case @samp{e} with an acute accent.  That is to say
855that this code should map to code 233 (decimal) in the usual charset.
856By executing:
857
858@example
859recode -k 130:233
860@end example
861
862@noindent
863you should obtain a listing similar to:
864
865@example
866AtariST atarist
867CWI cphu cwi cwi2
868IBM437 437 cp437 ibm437
869IBM850 850 cp850 ibm850
870IBM851 851 cp851 ibm851
871IBM852 852 cp852 ibm852
872IBM857 857 cp857 ibm857
873IBM860 860 cp860 ibm860
874IBM861 861 cp861 cpis ibm861
875IBM863 863 cp863 ibm863
876IBM865 865 cp865 ibm865
877@end example
878
879You can give more than one clue at once, to restrict the list further.
880Suppose you have @emph{also} guessed that code 211 of the unknown
881charset represents an upper case @samp{E} with diaeresis, that is, code
882203 in the usual charset.  By requesting:
883
884@example
885recode -k 130:233,211:203
886@end example
887
888@noindent
889you should obtain:
890
891@example
892IBM850 850 cp850 ibm850
893IBM852 852 cp852 ibm852
894IBM857 857 cp857 ibm857
895@end example
896
897The usual charset may be overridden by specifying one non-option argument.
898For example, to request the list of charsets for which code 130 maps to
899code 142 for the Macintosh, you may ask:
900
901@example
902recode -k 130:142 mac
903@end example
904
905@noindent
906and get:
907
908@example
909AtariST atarist
910CWI cphu cwi cwi2
911IBM437 437 cp437 ibm437
912IBM850 850 cp850 ibm850
913IBM851 851 cp851 ibm851
914IBM852 852 cp852 ibm852
915IBM857 857 cp857 ibm857
916IBM860 860 cp860 ibm860
917IBM861 861 cp861 cpis ibm861
918IBM863 863 cp863 ibm863
919IBM865 865 cp865 ibm865
920@end example
921
922@noindent
923which, of course, is identical to the result of the first example, since
924the code 142 for the Macintosh is a small @samp{e} with acute.
925
926More formally, option @samp{-k} lists all possible @emph{before}
927charsets for the @emph{after} charset given as the sole non-option
928argument to @code{recode}, but subject to restrictions given in
929@var{pairs}.  If there is no non-option argument, the @emph{after}
930charset is taken to be the default charset for this @code{recode}.
931
932The restrictions are given as a comma separated list of pairs, each pair
933consisting of two numbers separated by a colon.  The numbers are taken
934as decimal when the initial digit is between @samp{1} and @samp{9};
935@samp{0x} starts an hexadecimal number, or else @samp{0} starts an
936octal number.  The first number is a code in any @emph{before} charset,
937while the second number is a code in the specified @emph{after} charset.
938If the first number would not be transformed into the second number by
939recoding from some @emph{before} charset to the @emph{after} charset,
940then this @emph{before} charset is rejected.  A @emph{before} charset is
941listed only if it is not rejected by any pair.  The program will only test
942those @emph{before} charsets having a tabular style internal description
943(@pxref{Tabular}), so should be the selected @emph{after} charset.
944
945The produced list is in fact a subset of the list produced by the
946option @samp{-l}.  As for option @samp{-l}, the non-option argument
947is interpreted as a charset name, possibly abbreviated to any non
948ambiguous prefix.
949
950@item -l[@var{format}]
951@itemx --list[=@var{format}]
952@opindex -l
953@opindex --list
954@cindex listing charsets
955@cindex information about charsets
956This option asks for information about all charsets, or about one
957particular charset.  No file will be recoded.
958
959If there is no non-option arguments, @code{recode} ignores the @var{format}
960value of the option, it writes a sorted list of charset names on standard
961output, one per line.  When a charset name have aliases or synonyms,
962they follow the true charset name on its line, sorted from left to right.
963Each charset or alias is followed by its implied surfaces, if any.  This list
964is over two hundred lines.  It is best used with @samp{grep -i}, as in:
965
966@example
967recode -l | grep -i greek
968@end example
969
970There might be one non-option argument, in which case it is interpreted
971as a charset name, possibly abbreviated to any non ambiguous prefix.
972This particular usage of the @samp{-l} option is obeyed @emph{only} for
973charsets having a tabular style internal description (@pxref{Tabular}).
974Even if most charsets have this property, some do not, and the option
975@samp{-l} cannot be used to detail these particular charsets.  For knowing
976if a particular charset can be listed this way, you should merely try
977and see if this works.  The @var{format} value of the option is a keyword
978from the following list.  Keywords may be abbreviated by dropping suffix
979letters, and even reduced to the first letter only:
980
981@table @samp
982@item decimal
983This format asks for the production on standard output of a concise
984tabular display of the charset, in which character code values are
985expressed in decimal.
986
987@item octal
988This format uses octal instead of decimal in the concise tabular display
989of the charset.
990
991@item hexadecimal
992This format uses hexadecimal instead of decimal in the concise tabular
993display of the charset.
994
995@item full
996This format requests an extensive display of the charset on standard output,
997using one line per character showing its decimal, hexadecimal, octal and
998@code{UCS-2} code values, and also a descriptive comment which should be
999the 10646 name for the character.
1000
1001@vindex LANGUAGE@r{, when listing charsets}
1002@vindex LANG@r{, when listing charsets}
1003@cindex French description of charsets
1004The descriptive comment is given in English and ASCII, yet if the English
1005description is not available but a French one is, then the French description
1006is given instead, using @w{Latin-1}.  However, if the @code{LANGUAGE}
1007or @code{LANG} environment variable begins with the letters @samp{fr},
1008then listing preference goes to French when both descriptions are available.
1009@end table
1010
1011When option @samp{-l} is used together with a @var{charset} argument,
1012the @var{format} defaults to @code{decimal}.
1013
1014@item -T
1015@itemx --find-subsets
1016@opindex -T
1017@opindex --find-subsets
1018@cindex identifying subsets in charsets
1019@cindex subsets in charsets
1020This option is a maintainer tool for evaluating the redundancy of those
1021charsets, in @code{recode}, which are internally represented by an @code{UCS-2}
1022data table.  After the listing has been produced, the program exits
1023without doing any recoding.  The output is meant to be sorted, like
1024this: @w{@samp{recode -T | sort}}.  The option triggers @code{recode} into
1025comparing all pairs of charsets, seeking those which are subsets of others.
1026The concept and results are better explained through a few examples.
1027Consider these three sample lines from @samp{-T} output:
1028
1029@example
1030[  0] IBM891 == IBM903
1031[  1] IBM1004 < CP1252
1032[ 12] INVARIANT < CSA_Z243.4-1985-1
1033@end example
1034
1035@noindent
1036The first line means that @code{IBM891} and @code{IBM903} are completely
1037identical as far as @code{recode} is concerned, so one is fully redundant
1038to the other.  The second line says that @code{IBM1004} is wholly
1039contained within @code{CP1252}, yet there is a single character which is
1040in @code{CP1252} without being in @code{IBM1004}.  The third line says
1041that @code{INVARIANT} is wholly contained within @code{CSA_Z243.4-1985-1},
1042but twelve characters are in @code{CSA_Z243.4-1985-1} without being in
1043@code{INVARIANT}.  The whole output might most probably be reduced and
1044made more significant through a transitivity study.
1045@end table
1046
1047@node Recoding, Reversibility, Listings, Invoking recode
1048@section Controlling how files are recoded
1049
1050The following options have the purpose of giving the user some fine
1051grain control over the recoding operation themselves.
1052
1053@table @samp
1054
1055@item -c
1056@itemx --colons
1057@opindex -c
1058@opindex --colons
1059@cindex diaeresis
1060With @code{Texte} Easy French conventions, use the column @kbd{:}
1061instead of the double-quote @kbd{"} for marking diaeresis.
1062@xref{Texte}.
1063
1064@item -g
1065@itemx --graphics
1066@opindex -g
1067@opindex --graphics
1068@cindex IBM graphics characters
1069@cindex box-drawing characters
1070This option is only meaningful while getting @emph{out} of the
1071@code{IBM-PC} charset.  In this charset, characters 176 to 223 are used
1072for constructing rulers and boxes, using simple or double horizontal or
1073vertical lines.  This option forces the automatic selection of ASCII
1074characters for approximating these rulers and boxes, at cost of making
1075the transformation irreversible.  Option @samp{-g} implies @samp{-f}.
1076
1077@item -t
1078@itemx --touch
1079@opindex -t
1080@opindex --touch
1081@cindex time stamps of files
1082@cindex file time stamps
1083The @emph{touch} option is meaningful only when files are recoded over
1084themselves.  Without it, the time-stamps associated with files are
1085preserved, to reflect the fact that changing the code of a file does not
1086really alter its informational contents.  When the user wants the
1087recoded files to be time-stamped at the recoding time, this option
1088inhibits the automatic protection of the time-stamps.
1089
1090@item -v
1091@itemx --verbose
1092@opindex -v
1093@opindex --verbose
1094@cindex verbose operation
1095@cindex details about recoding
1096@cindex recoding details
1097@cindex quality of recoding
1098Before doing any recoding, the program will first print on the @code{stderr}
1099stream the list of all intermediate charsets planned for recoding, starting
1100with the @var{before} charset and ending with the @var{after} charset.
1101It also prints an indication of the recoding quality, as one of the word
1102@samp{reversible}, @samp{one to one}, @samp{one to many}, @samp{many to
1103one} or @samp{many to many}.
1104
1105This information will appear once or twice.  It is shown a second time
1106only when the optimisation and step merging phase succeeds in replacing
1107many single steps by a new one.
1108
1109This option also has a second effect.  The program will print on
1110@code{stderr} one message per recoded @var{file}, so as to keep the user
1111informed of the progress of its command.
1112
1113An easy way to know beforehand the sequence or quality of a recoding is
1114by using the command such as:
1115
1116@example
1117recode -v @var{before}..@var{after} < /dev/null
1118@end example
1119
1120@noindent
1121using the fact that, in @code{recode}, an empty input file produces
1122an empty output file.
1123
1124@item -x @var{charset}
1125@itemx --ignore=@var{charset}
1126@opindex -x
1127@opindex --ignore
1128@cindex ignore charsets
1129@cindex recoding path, rejection
1130This option tells the program to ignore any recoding path through the
1131specified @var{charset}, so disabling any single step using this charset
1132as a start or end point.  This may be used when the user wants to force
1133@code{recode} into using an alternate recoding path (yet using chained
1134requests offers a finer control, @pxref{Requests}).
1135
1136@var{charset} may be abbreviated to any unambiguous prefix.
1137@end table
1138
1139@node Reversibility, Sequencing, Recoding, Invoking recode
1140@section Reversibility issues
1141
1142The following options are somewhat related to reversibility issues:
1143
1144@table @samp
1145@item -f
1146@itemx --force
1147@opindex -f
1148@opindex --force
1149@cindex force recoding
1150@cindex irreversible recoding
1151With this option, irreversible or otherwise erroneous recodings are run
1152to completion, and @code{recode} does not exit with a non-zero status if
1153it would be only because irreversibility matters.  @xref{Reversibility}.
1154
1155Without this option, @code{recode} tries to protect you against recoding
1156a file irreversibly over itself@footnote{There are still some cases of
1157ambiguous output which are rather difficult to detect, and for which
1158the protection is not active.}.  Whenever an irreversible recoding is
1159met, or any other recoding error, @code{recode} produces a warning on
1160standard error.  The current input file does not get replaced by its
1161recoded version, and @code{recode} then proceeds with the recoding of
1162the next file.
1163
1164When the program is merely used as a filter, standard output will have
1165received a partially recoded copy of standard input, up to the first
1166error point.  After all recodings have been done or attempted, and if
1167some recoding has been aborted, @code{recode} exits with a non-zero status.
1168
1169In releases of @code{recode} prior to version 3.5, this option was always
1170selected, so it was rather meaningless.  Nevertheless, users were invited
1171to start using @samp{-f} right away in scripts calling @code{recode}
1172whenever convenient, in preparation for the current behaviour.
1173
1174@item -q
1175@itemx --quiet
1176@itemx --silent
1177@opindex -q
1178@opindex --quiet
1179@opindex --silent
1180@cindex suppressing diagnostic messages
1181@cindex error messages, suppressing
1182@cindex silent operation
1183This option has the sole purpose of inhibiting warning messages about
1184irreversible recodings, and other such diagnostics.  It has no other
1185effect, in particular, it does @emph{not} prevent recodings to be aborted
1186or @code{recode} to return a non-zero exit status when irreversible
1187recodings are met.
1188
1189This option is set automatically for the children processes, when recode
1190splits itself in many collaborating copies.  Doing so, the diagnostic is
1191issued only once by the parent.  See option @samp{-p}.
1192
1193@item -s
1194@itemx --strict
1195@opindex -s
1196@opindex --strict
1197@cindex strict operation
1198@cindex map filling, disable
1199@cindex disable map filling
1200By using this option, the user requests that @code{recode} be very strict
1201while recoding a file, merely losing in the transformation any character
1202which is not explicitly mapped from a charset to another.  Such a loss is
1203not reversible and so, will bring @code{recode} to fail, unless the option
1204@samp{-f} is also given as a kind of counter-measure.
1205
1206Using @samp{-s} without @samp{-f} might render the @code{recode} program
1207very susceptible to the slighest file abnormalities.  Despite the fact
1208that it might be
1209irritating to some users, such paranoia is sometimes wanted and useful.
1210@end table
1211
1212@cindex reversibility of recoding
1213Even if @code{recode} tries hard to keep the recodings reversible,
1214you should not develop an unconditional confidence in its ability to
1215do so.  You @emph{ought} to keep only reasonable expectations about
1216reverse recodings.  In particular, consider:
1217
1218@itemize @bullet
1219@item
1220Most transformations are fully reversible for all inputs, but lose this
1221property whenever @samp{-s} is specified.
1222
1223@item
1224A few transformations are not meant to be reversible, by design.
1225
1226@item
1227Reversibility sometimes depends on actual file contents and cannot
1228be ascertained beforehand, without reading the file.
1229
1230@item
1231Reversibility is never absolute across successive versions of this
1232program.  Even correcting a small bug in a mapping could induce slight
1233discrepancies later.
1234
1235@item
1236Reversibility is easily lost by merging.  This is best explained through
1237an example.  If you reversibly recode a file from charset @var{A} to
1238charset @var{B}, then you reversibly recode the result from charset
1239@var{B} to charset @var{C}, you cannot expect to recover the original
1240file by merely recoding from charset @var{C} directly to charset @var{A}.
1241You will instead have to recode from charset @var{C} back to charset
1242@var{B}, and only then from charset @var{B} to charset @var{A}.
1243
1244@item
1245Faulty files create a particular problem.  Consider an example, recoding
1246from @code{IBM-PC} to @code{Latin-1}.  End of lines are represented as
1247@samp{\r\n} in @code{IBM-PC} and as @samp{\n} in @code{Latin-1}.  There
1248is no way by which a faulty @code{IBM-PC} file containing a @samp{\n}
1249not preceded by @samp{\r} be translated into a @code{Latin-1} file, and
1250then back.
1251
1252@item
1253There is another difficulty arising from code equivalences.  For
1254example, in a @code{LaTeX} charset file, the string @samp{\^\i@{@}}
1255could be recoded back and forth through another charset and become
1256@samp{\^@{\i@}}.  Even if the resulting file is equivalent to the
1257original one, it is not identical.
1258@end itemize
1259
1260@cindex map filling
1261Unless option @samp{-s} is used, @code{recode} automatically tries to
1262fill mappings with invented correspondences, often making them fully
1263reversible.  This filling is not made at random.  The algorithm tries to
1264stick to the identity mapping and, when this is not possible, it prefers
1265generating many small permutation cycles, each involving only a few
1266codes.
1267
1268For example, here is how @code{IBM-PC} code 186 gets translated to
1269@kbd{control-U} in @code{Latin-1}.  @kbd{Control-U} is 21.  Code 21 is the
1270@code{IBM-PC} section sign, which is 167 in @code{Latin-1}.  @code{recode}
1271cannot reciprocate 167 to 21, because 167 is the masculine ordinal indicator
1272within @code{IBM-PC}, which is 186 in @code{Latin-1}.  Code 186 within
1273@code{IBM-PC} has no @code{Latin-1} equivalent; by assigning it back to 21,
1274@code{recode} closes this short permutation loop.
1275
1276As a consequence of this map filling, @code{recode} may sometimes produce
1277@emph{funny} characters.  They may look annoying, they are nevertheless
1278helpful when one changes his (her) mind and wants to revert to the prior
1279recoding.  If you cannot stand these, use option @samp{-s}, which asks
1280for a very strict recoding.
1281
1282This map filling sometimes has a few surprising consequences, which
1283some users wrongly interpreted as bugs.  Here are two examples.
1284
1285@enumerate
1286@item
1287In some cases, @code{recode} seems to copy a file without recoding it.
1288But in fact, it does.  Consider a request:
1289
1290@example
1291recode l1..us < File-Latin1 > File-ASCII
1292cmp File-Latin1 File-ASCII
1293@end example
1294
1295@noindent
1296then @code{cmp} will not report any difference.  This is quite normal.
1297@w{@code{Latin-1}} gets correctly recoded to ASCII for charsets commonalities
1298(which are the first 128 characters, in this case).  The remaining last
1299128 @w{@code{Latin-1}} characters have no ASCII correspondent.  Instead
1300of losing
1301them, @code{recode} elects to map them to unspecified characters of ASCII, so
1302making the recoding reversible.  The simplest way of achieving this is
1303merely to keep those last 128 characters unchanged.  The overall effect
1304is copying the file verbatim.
1305
1306If you feel this behaviour is too generous and if you do not wish to
1307care about reversibility, simply use option @samp{-s}.  By doing so,
1308@code{recode} will strictly map only those @w{@code{Latin-1}} characters
1309which have
1310an ASCII equivalent, and will merely drop those which do not.  Then,
1311there is more chance that you will observe a difference between the
1312input and the output file.
1313
1314@item
1315Recoding the wrong way could sometimes give the false impression that
1316recoding has @emph{almost} been done properly.  Consider the requests:
1317
1318@example
1319recode 437..l1 < File-Latin1 > Temp1
1320recode 437..l1 < Temp1 > Temp2
1321@end example
1322
1323@noindent
1324so declaring wrongly @file{File-Latin1} to be an IBM-PC file, and
1325recoding to @code{Latin-1}.  This is surely ill defined and not meaningful.
1326Yet, if you repeat this step a second time, you might notice that
1327many (not all) characters in @file{Temp2} are identical to those in
1328@file{File-Latin1}.  Sometimes, people try to discover how @code{recode}
1329works by experimenting a little at random, rather than reading and
1330understanding the documentation; results such as this are surely confusing,
1331as they provide those people with a false feeling that they understood
1332something.
1333
1334Reversible codings have this property that, if applied several times
1335in the same direction, they will eventually bring any character back
1336to its original value.  Since @code{recode} seeks small permutation
1337cycles when creating reversible codings, besides characters unchanged
1338by the recoding, most permutation cycles will be of length 2, and
1339fewer of length 3, etc.  So, it is just expectable that applying the
1340recoding twice in the same direction will recover most characters,
1341but will fail to recover those participating in permutation cycles of
1342length 3.  On the other end, recoding six times in the same direction
1343would recover all characters in cycles of length 1, 2, 3 or 6.
1344@end enumerate
1345
1346@node Sequencing, Mixed, Reversibility, Invoking recode
1347@section Selecting sequencing methods
1348
1349@cindex sequencing
1350This program uses a few techniques when it is discovered that many
1351passes are needed to comply with the @var{request}.  For example,
1352suppose that four elementary steps were selected at recoding path
1353optimisation time.  Then @code{recode} will split itself into four
1354different interconnected tasks, logically equivalent to:
1355
1356@example
1357@var{step1} <@var{input} | @var{step2} | @var{step3} | @var{step4} >@var{output}
1358@end example
1359
1360The splitting into subtasks is often done using Unix pipes.
1361But the splitting may also be completely avoided, and rather
1362simulated by using memory buffer, or intermediate files.  The various
1363@samp{--sequence=@var{strategy}} options gives you control over the flow
1364methods, by replacing @var{strategy} with @samp{memory}, @samp{pipe}
1365or @samp{files}.  So, these options may be used to override the default
1366behaviour, which is also explained below.
1367
1368@table @samp
1369@item --sequence=memory
1370@opindex --sequence
1371@cindex memory sequencing
1372When the recoding requires a combination of two or more elementary
1373recoding steps, this option forces many passes over the data, using
1374in-memory buffers to hold all intermediary results.
1375@c This should be the default behaviour when
1376@c files to be recoded are @emph{small} enough.
1377
1378@item -i
1379@itemx --sequence=files
1380@opindex -i
1381@cindex file sequencing
1382When the recoding requires a combination of two or more elementary
1383recoding steps, this option forces many passes over the data, using
1384intermediate files between passes.  This is the default behaviour when
1385files are recoded over themselves.  If this option is selected in filter
1386mode, that is, when the program reads standard input and writes standard
1387output, it might take longer for programs further down the pipe chain to
1388start receiving some recoded data.
1389
1390@item -p
1391@itemx --sequence=pipe
1392@opindex -p
1393@cindex pipe sequencing
1394When the recoding requires a combination of two or more elementary
1395recoding steps, this option forces the program to fork itself into a few
1396copies interconnected with pipes, using the @code{pipe(2)} system call.
1397All copies of the program operate in parallel.  This is the default
1398behaviour in filter mode.  If this option is used when files are recoded
1399over themselves, this should also save disk space because some temporary
1400files might not be needed, at the cost of more system overhead.
1401
1402If, at installation time, the @code{pipe(2)} call is said to be
1403unavailable, selecting option @samp{-p} is equivalent to selecting
1404option @samp{-i}.  (This happens, for example, on MS-DOS systems.)
1405@end table
1406
1407@node Mixed, Emacs, Sequencing, Invoking recode
1408@section Using mixed charset input
1409
1410In real life and practice, textual files are often made up of many charsets
1411at once.  Some parts of the file encode one charset, while other parts
1412encode another charset, and so forth.  Usually, a file does not toggle
1413between more than two or three charsets.  The means to distinguish
1414which charsets are encoded at various places is not always available.
1415The @code{recode} program is able to handle only a few simple cases
1416of mixed input.
1417
1418The default @code{recode} behaviour is to expect pure charset files, to
1419be recoded as other pure charset files.  However, the following options
1420allow for a few precise kinds of mixed charset files.
1421
1422@ignore
1423Some notes on transliteration and substitution.
1424
1425Transliteration is still much study, discussion and work to come, but
1426when generic transliteration will be added in @code{recode}, it will be
1427added @emph{through} the @code{recode} library.
1428
1429However, I agree that it might be *convenient* that the `latin1..fi'
1430conversion works by letting all ASCII characters through, but then, the
1431result would be a mix of ASCII and `fi', it would not be pure `fi' anymore.
1432It would be convenient because, in practice, people might write programs in
1433ASCII, keeping comments or strings directly in `fi', all in the same file.
1434The original files are indeed mixed, and people sometimes expect that
1435`recode' will do mixed conversions.
1436
1437A conversion does not become *right* because it is altered to be more
1438convenient.  And recode is not *wrong* because it does not offer some
1439conveniences people would like to have.  As long as `recode' main job is
1440producing `fi', than '[' is just not representable in `fi', and recode is
1441rather right in not letting `[' through.  It has to do something special
1442about it.  The character might be thrown away, transliterated or replaced
1443by a substitute, or mapped to some other code for reversibility purposes.
1444
1445Transliteration or substitution are currently not implemented in `recode',
1446yet for the last few years, I've been saving documentation about these
1447phenomena.  The transliteration which you are asking for, here, is that the
1448'[' character in @w{Latin-1}, for example, be transliterated to A-umlaut in
1449`fi', which is a bit non-meaningful.  Remember, there is no `[' in `fi'.
1450@end ignore
1451
1452@table @samp
1453@item -d
1454@itemx --diacritics
1455@opindex -d
1456@opindex --diacritics
1457@cindex convert a subset of characters
1458@cindex partial conversion
1459While converting to or from one of @code{HTML} or @code{LaTeX}
1460charset, limit conversion to some subset of all characters.
1461For @code{HTML}, limit conversion to the subset of all non-ASCII
1462characters.  For @code{LaTeX}, limit conversion to the subset of all
1463non-English letters.  This is particularly useful, for example, when
1464people create what would be valid @code{HTML}, @TeX{} or La@TeX{}
1465files, if only they were using provided sequences for applying
1466diacritics instead of using the diacriticised characters directly
1467from the underlying character set.
1468
1469While converting to @code{HTML} or @code{LaTeX} charset, this option
1470assumes that characters not in the said subset are properly coded
1471or protected already, @code{recode} then transmit them literally.
1472While converting the other way, this option prevents translating back
1473coded or protected versions of characters not in the said subset.
1474@xref{HTML}.  @xref{LaTeX}.
1475
1476@ignore
1477@item -M
1478@itemx --message
1479@opindex -M
1480@opindex --message
1481Option @samp{-M} would be for messages, it would ideally process @w{RFC
14821522} inserts
1483in ASCII headers, converting them to the goal code, rewriting some MIME
1484header line too, and stopping its special work at the first empty line.
1485A special combination of both capabilities would be for the recoding of
1486PO files, in which the header, and @code{msgid} and @code{msgstr} strings, might
1487all use different charsets.  Recoding some PO files currently looks like
1488a nightmare, which I would like @code{recode} to repair.
1489@end ignore
1490
1491@item -S[@var{language}]
1492@itemx --source[=@var{language}]
1493@opindex -S
1494@opindex --source
1495@cindex convert strings and comments
1496@cindex string and comments conversion
1497The bulk of the input file is expected to be written in @code{ASCII},
1498except for parts, like comments and string constants, which are written
1499using another charset than @code{ASCII}.  When @var{language} is @samp{c},
1500the recoding will proceed only with the contents of comments or strings,
1501while everything else will be copied without recoding.  When @var{language}
1502is @samp{po}, the recoding will proceed only within translator comments
1503(those having whitespace immediately following the initial @samp{#})
1504and with the contents of @code{msgstr} strings.
1505
1506For the above things to work, the non-@code{ASCII} encoding of the comment
1507or string should be such that an @code{ASCII} scan will successfully find
1508where the comment or string ends.
1509
1510Even if @code{ASCII} is the usual charset for writing programs, some
1511compilers are able to directly read other charsets, like @code{UTF-8}, say.
1512There is currently no provision in @code{recode} for reading mixed charset
1513sources which are not based on @code{ASCII}.  It is probable that the need
1514for mixed recoding is not as pressing in such cases.
1515
1516For example, after one does:
1517
1518@example
1519recode -Spo pc/..u8 < @var{input}.po > @var{output}.po
1520@end example
1521
1522@noindent
1523file @file{@var{output}.po} holds a copy of @file{@var{input}.po} in which
1524@emph{only} translator comments and the contents of @code{msgstr} strings
1525have been recoded from the @code{IBM-PC} charset to pure @code{UTF-8},
1526without attempting conversion of end-of-lines.  Machine generated comments
1527and original @code{msgid} strings are not to be touched by this recoding.
1528
1529If @var{language} is not specified, @samp{c} is assumed.
1530@end table
1531
1532@node Emacs, Debugging, Mixed, Invoking recode
1533@section Using @code{recode} within Emacs
1534
1535The fact @code{recode} is a filter makes it quite easy to use from
1536within GNU Emacs.  For example, recoding the whole buffer from
1537the @code{IBM-PC} charset to current charset (@w{@code{Latin-1}} on
1538Unix) is easily done with:
1539
1540@example
1541C-x h C-u M-| recode ibmpc RET
1542@end example
1543
1544@noindent
1545@samp{C-x h} selects the whole buffer, and @samp{C-u M-|} filters and
1546replaces the current region through the given shell command.  Here is
1547another example, binding the keys @w{@samp{C-c T}} to the recoding of
1548the current region from Easy French to @w{@code{Latin-1}} (on Unix) and the key
1549@w{@samp{C-u C-c T}} from @w{@code{Latin-1}} (on Unix) to Easy French:
1550
1551@example
1552(global-set-key "\C-cT" 'recode-texte)
1553
1554(defun recode-texte (flag)
1555  (interactive "P")
1556  (shell-command-on-region
1557   (region-beginning) (region-end)
1558   (concat "recode " (if flag "..txte" "txte")) t)
1559  (exchange-point-and-mark))
1560@end example
1561
1562@node Debugging,  , Emacs, Invoking recode
1563@section Debugging considerations
1564
1565It is our experience that when @code{recode} does not provide satisfying
1566results, either @code{recode} was not called properly, correct results
1567raised some doubts nevertheless, or files to recode were somewhat mangled.
1568Genuine bugs are surely possible.
1569
1570Unless you already are a @code{recode} expert, it might be a good idea to
1571quickly revisit the tutorial (@pxref{Tutorial}) or the prior sections in this
1572chapter, to make sure that you properly formatted your recoding request.
1573In the case you intended to use @code{recode} as a filter, make sure that you
1574did not forget to redirect your standard input (through using the @kbd{<}
1575symbol in the shell, say).  Some @code{recode} false mysteries are also
1576easily explained, @xref{Reversibility}.
1577
1578For the other cases, some investigation is needed.  To illustrate how to
1579proceed, let's presume that you want to recode the @file{nicepage} file,
1580coded @code{UTF-8}, into @code{HTML}.  The problem is that the command
1581@samp{recode u8..h nicepage} yields:
1582
1583@example
1584recode: Invalid input in step `UTF-8..ISO-10646-UCS-2'
1585@end example
1586
1587One good trick is to use @code{recode} in filter mode instead of in file
1588replacement mode, @xref{Synopsis}.  Another good trick is to use the
1589@samp{-v} option asking for a verbose description of the recoding steps.
1590We could rewrite our recoding call as @samp{recode -v u8..h <nicepage},
1591to get something like:
1592
1593@example
1594Request: UTF-8..:libiconv:..ISO-10646-UCS-2..HTML_4.0
1595Shrunk to: UTF-8..ISO-10646-UCS-2..HTML_4.0
1596[@dots{}@var{some output}@dots{}]
1597recode: Invalid input in step `UTF-8..ISO-10646-UCS-2'
1598@end example
1599
1600This might help you to better understand what the diagnostic means.  The
1601recoding request is achieved in two steps, the first recodes @code{UTF-8}
1602into @code{UCS-2}, the second recodes @code{UCS-2} into @code{HTML}.
1603The problem occurs within the first of these two steps, and since, the
1604input of this step is the input file given to @code{recode}, this is
1605this overall input file which seems to be invalid.  Also, when used in
1606filter mode, @code{recode} processes as much input as possible before the
1607error occurs and sends the result of this processing to standard output.
1608Since the standard output has not been redirected to a file, it is merely
1609displayed on the user screen.  By inspecting near the end of the resulting
1610@code{HTML} output, that is, what was recoding a bit before the recoding
1611was interrupted, you may infer about where the error stands in the real
1612@code{UTF-8} input file.
1613
1614If you have the proper tools to examine the intermediate recoding data,
1615you might also prefer to reduce the problem to a single step to better
1616study it.  This is what I usually do.  For example, the last @code{recode}
1617call above is more or less equivalent to:
1618
1619@example
1620recode -v UTF-8..ISO_10646-UCS-2 <nicepage >temporary
1621recode -v ISO_10646-UCS-2..HTML_4.0 <temporary
1622rm temporary
1623@end example
1624
1625If you know that the problem is within the first step, you might prefer to
1626concentrate on using the first @code{recode} line.  If you know that the
1627problem is within the second step, you might execute the first @code{recode}
1628line once and for all, and then play with the second @code{recode} call,
1629repeatedly using the @file{temporary} file created once by the first call.
1630
1631Note that the @samp{-f} switch may be used to force the production of
1632@code{HTML} output despite invalid input, it might be satisfying enough
1633for you, and easier than repairing the input file.  That depends on how
1634strict you would like to be about the precision of the recoding process.
1635
1636If you later see that your HTML file begins with @samp{@@lt;html@@gt;} when
1637you expected @samp{<html>}, then @code{recode} might have done a bit more
1638that you wanted.  In this case, your input file was half-@code{UTF-8},
1639half-@code{HTML} already, that is, a mixed file (@pxref{Mixed}).  There is a
1640special @code{-d} switch for this case.  So, your might be end up calling
1641@samp{recode -fd nicepage}.  Until you are quite sure that you accept
1642overwriting your input file whatever what, I recommend that you stick with
1643filter mode.
1644
1645If, after such experiments, you seriously think that the @code{recode}
1646program does not behave properly, there might be a genuine bug in the
1647program itself, in which case I invite you to to contribute a bug report,
1648@xref{Contributing}.
1649
1650@node Library, Universal, Invoking recode, Top
1651@chapter A recoding library
1652
1653@cindex recoding library
1654The program named @code{recode} is just an application of its recoding
1655library.  The recoding library is available separately for other C
1656programs.  A good way to acquire some familiarity with the recoding
1657library is to get acquainted with the @code{recode} program itself.
1658
1659To use the recoding library once it is installed, a C program needs to
1660have a line:
1661
1662@example
1663#include <recode.h>
1664@end example
1665
1666@noindent
1667near its beginning, and the user should have @samp{-lrecode} on the
1668linking call, so modules from the recoding library are found.
1669
1670The library is still under development.  As it stands, it contains four
1671identifiable sets of routines: the outer level functions, the request
1672level functions, the task level functions and the charset level functions.
1673There are discussed in separate sections.
1674
1675For effectively using the recoding library in most applications, it should
1676be rarely needed to study anything beyond the main initialisation function
1677at outer level, and then, various functions at request level.
1678
1679@menu
1680* Outer level::         Outer level functions
1681* Request level::       Request level functions
1682* Task level::          Task level functions
1683* Charset level::       Charset level functions
1684* Errors::              Handling errors
1685@end menu
1686
1687@node Outer level, Request level, Library, Library
1688@section Outer level functions
1689
1690@cindex outer level functions
1691The outer level functions mainly prepare the whole recoding library for
1692use, or do actions which are unrelated to specific recodings.  Here is
1693an example of a program which does not really make anything useful.
1694
1695@example
1696@group
1697#include <stdbool.h>
1698#include <recode.h>
1699
1700const char *program_name;
1701
1702int
1703main (int argc, char *const *argv)
1704@{
1705  program_name = argv[0];
1706  RECODE_OUTER outer = recode_new_outer (true);
1707
1708  recode_delete_outer (outer);
1709  exit (0);
1710@}
1711@end group
1712@end example
1713
1714@vindex RECODE_OUTER structure
1715The header file @code{<recode.h>} declares an opaque @code{RECODE_OUTER}
1716structure, which the programmer should use for allocating a variable in
1717his program (let's assume the programmer is a male, here, no prejudice
1718intended).  This @samp{outer} variable is given as a first argument to
1719all outer level functions.
1720
1721@cindex @code{stdbool.h} header
1722@cindex @code{bool} data type
1723The @code{<recode.h>} header file uses the Boolean type setup by the
1724system header file @code{<stdbool.h>}.  But this header file is still
1725fairly new in C standards, and likely does not exist everywhere.  If you
1726system does not offer this system header file yet, the proper compilation
1727of the @code{<recode.h>} file could be guaranteed through the replacement
1728of the inclusion line by:
1729
1730@example
1731typedef enum @{false = 0, true = 1@} bool;
1732@end example
1733
1734People wanting wider portability, or Autoconf lovers, might arrange their
1735@file{configure.in} for being able to write something more general, like:
1736
1737@example
1738@group
1739#if STDC_HEADERS
1740# include <stdlib.h>
1741#endif
1742
1743/* Some systems do not define EXIT_*, even with STDC_HEADERS.  */
1744#ifndef EXIT_SUCCESS
1745# define EXIT_SUCCESS 0
1746#endif
1747#ifndef EXIT_FAILURE
1748# define EXIT_FAILURE 1
1749#endif
1750/* The following test is to work around the gross typo in systems like Sony
1751   NEWS-OS Release 4.0C, whereby EXIT_FAILURE is defined to 0, not 1.  */
1752#if !EXIT_FAILURE
1753# undef EXIT_FAILURE
1754# define EXIT_FAILURE 1
1755#endif
1756
1757#if HAVE_STDBOOL_H
1758# include <stdbool.h>
1759#else
1760typedef enum @{false = 0, true = 1@} bool;
1761#endif
1762
1763#include <recode.h>
1764
1765const char *program_name;
1766
1767int
1768main (int argc, char *const *argv)
1769@{
1770  program_name = argv[0];
1771  RECODE_OUTER outer = recode_new_outer (true);
1772
1773  recode_term_outer (outer);
1774  exit (EXIT_SUCCESS);
1775@}
1776@end group
1777@end example
1778
1779@noindent
1780but we will not insist on such details in the examples to come.
1781
1782@itemize @bullet
1783@item Initialisation functions
1784@cindex initialisation functions, outer
1785
1786@example
1787RECODE_OUTER recode_new_outer (@var{auto_abort});
1788bool recode_delete_outer (@var{outer});
1789@end example
1790
1791@findex recode_new_outer
1792@findex recode_delete_outer
1793The recoding library absolutely needs to be initialised before being used,
1794and @code{recode_new_outer} has to be called once, first.  Besides the
1795@var{outer} it is meant to initialise, the function accepts a Boolean
1796argument whether or not the library should automatically issue diagnostics
1797on standard and abort the whole program on errors.  When @var{auto_abort}
1798is @code{true}, the library later conveniently issues diagnostics itself,
1799and aborts the calling program on errors.  This is merely a convenience,
1800because if this parameter was @code{false}, the calling program should always
1801take care of checking the return value of all other calls to the recoding
1802library functions, and when any error is detected, issue a diagnostic and
1803abort processing itself.
1804
1805Regardless of the setting of @var{auto_abort}, all recoding library
1806functions return a success status.  Most functions are geared for returning
1807@code{false} for an error, and @code{true} if everything went fine.
1808Functions returning structures or strings return @code{NULL} instead
1809of the result, when the result cannot be produced.  If @var{auto_abort}
1810is selected, functions either return @code{true}, or do not return at all.
1811
1812As in the example above, @code{recode_new_outer} is called only once in
1813most cases.  Calling @code{recode_new_outer} implies some overhead, so
1814calling it more than once should preferably be avoided.
1815
1816The termination function @code{recode_delete_outer} reclaims the memory
1817allocated by @code{recode_new_outer} for a given @var{outer} variable.
1818Calling @code{recode_delete_outer} prior to program termination is more
1819aesthetic then useful, as all memory resources are automatically reclaimed
1820when the program ends.  You may spare this terminating call if you prefer.
1821
1822@item The @code{program_name} declaration
1823
1824@cindex @code{program_name} variable
1825As we just explained, the user may set the @code{recode} library so that,
1826in case of problems error, it issues the diagnostic itself and aborts the
1827whole processing.  This capability may be quite convenient.  When this
1828feature is used, the aborting routine includes the name of the running
1829program in the diagnostic.  On the other hand, when this feature is not
1830used, the library merely return error codes, giving the library user fuller
1831control over all this.  This behaviour is more like what usual libraries
1832do: they return codes and never abort.  However, I would rather not force
1833library users to necessarily check all return codes themselves, by leaving
1834no other choice.  In most simple applications, letting the library diagnose
1835and abort is much easier, and quite welcome.  This is precisely because
1836both possibilities exist that the @code{program_name} variable is needed: it
1837may be used by the library @emph{when} the user sets it to diagnose itself.
1838@end itemize
1839
1840@node Request level, Task level, Outer level, Library
1841@section Request level functions
1842
1843@cindex request level functions
1844The request level functions are meant to cover most recoding needs
1845programmers may have; they should provide all usual functionality.
1846Their API is almost stable by now.  To get started with request level
1847functions, here is a full example of a program which sole job is to filter
1848@code{ibmpc} code on its standard input into @code{latin1} code on its
1849standard output.
1850
1851@example
1852@group
1853#include <stdio.h>
1854#include <stdbool.h>
1855#include <recode.h>
1856
1857const char *program_name;
1858
1859int
1860main (int argc, char *const *argv)
1861@{
1862  program_name = argv[0];
1863  RECODE_OUTER outer = recode_new_outer (true);
1864  RECODE_REQUEST request = recode_new_request (outer);
1865  bool success;
1866
1867  recode_scan_request (request, "ibmpc..latin1");
1868
1869  success = recode_file_to_file (request, stdin, stdout);
1870
1871  recode_delete_request (request);
1872  recode_delete_outer (outer);
1873
1874  exit (success ? 0 : 1);
1875@}
1876@end group
1877@end example
1878
1879@vindex RECODE_REQUEST structure
1880The header file @code{<recode.h>} declares a @code{RECODE_REQUEST} structure,
1881which the programmer should use for allocating a variable in his program.
1882This @var{request} variable is given as a first argument to all request
1883level functions, and in most cases, may be considered as opaque.
1884
1885@itemize @bullet
1886@item Initialisation functions
1887@cindex initialisation functions, request
1888
1889@example
1890RECODE_REQUEST recode_new_request (@var{outer});
1891bool recode_delete_request (@var{request});
1892@end example
1893
1894@findex recode_new_request
1895@findex recode_delete_request
1896No @var{request} variable may not be used in other request level
1897functions of the recoding library before having been initialised by
1898@code{recode_new_request}.  There may be many such @var{request}
1899variables, in which case, they are independent of one another and
1900they all need to be initialised separately.  To avoid memory leaks, a
1901@var{request} variable should not be initialised a second time without
1902calling @code{recode_delete_request} to ``un-initialise'' it.
1903
1904Like for @code{recode_delete_outer}, calling @code{recode_delete_request}
1905prior to program termination, in the example above, may be left out.
1906
1907@item Fields of @code{struct recode_request}
1908@vindex recode_request structure
1909
1910Here are the fields of a @code{struct recode_request} which may be
1911meaningfully changed, once a @var{request} has been initialised by
1912@code{recode_new_request}, but before it gets used.  It is not very frequent,
1913in practice, that these fields need to be changed.  To access the fields,
1914you need to include @file{recodext.h} @emph{instead} of @file{recode.h},
1915in which case there also is a greater chance that you need to recompile
1916your programs if a new version of the recoding library gets installed.
1917
1918@table @code
1919@item verbose_flag
1920@vindex verbose_flag
1921This field is initially @code{false}.  When set to @code{true}, the
1922library will echo to stderr the sequence of elementary recoding steps
1923needed to achieve the requested recoding.
1924
1925@item diaeresis_char
1926@vindex diaeresis_char
1927This field is initially the ASCII value of a double quote @kbd{"},
1928but it may also be the ASCII value of a colon @kbd{:}.  In @code{texte}
1929charset, some countries use double quotes to mark diaeresis, while other
1930countries prefer colons.  This field contains the diaeresis character
1931for the @code{texte} charset.
1932
1933@item make_header_flag
1934@vindex make_header_flag
1935This field is initially @code{false}.  When set to @code{true}, it
1936indicates that the program is merely trying to produce a recoding table in
1937source form rather than completing any actual recoding.  In such a case,
1938the optimisation of step sequence can be attempted much more aggressively.
1939If the step sequence cannot be reduced to a single step, table production
1940will fail.
1941
1942@item diacritics_only
1943@vindex diacritics_only
1944This field is initially @code{false}.  For @code{HTML} and @code{LaTeX}
1945charset, it is often convenient to recode the diacriticized characters
1946only, while just not recoding other HTML code using ampersands or angular
1947brackets, or La@TeX{} code using backslashes.  Set the field to @code{true}
1948for getting this behaviour.  In the other charset, one can edit text as
1949well as HTML or La@TeX{} directives.
1950
1951@item ascii_graphics
1952@vindex ascii_graphics
1953This field is initially @code{false}, and relate to characters 176 to
1954223 in the @code{ibmpc} charset, which are use to draw boxes.  When set
1955to @code{true}, while getting out of @code{ibmpc}, ASCII characters are
1956selected so to graphically approximate these boxes.
1957@end table
1958
1959@item Study of request strings
1960
1961@example
1962bool recode_scan_request (@var{request}, "@var{string}");
1963@end example
1964
1965@findex recode_scan_request
1966The main role of a @var{request} variable is to describe a set of
1967recoding transformations.  Function @code{recode_scan_request} studies
1968the given @var{string}, and stores an internal representation of it into
1969@var{request}.  Note that @var{string} may be a full-fledged @code{recode}
1970request, possibly including surfaces specifications, intermediary
1971charsets, sequences, aliases or abbreviations (@pxref{Requests}).
1972
1973The internal representation automatically receives some pre-conditioning
1974and optimisation, so the @var{request} may then later be used many times
1975to achieve many actual recodings.  It would not be efficient calling
1976@code{recode_scan_request} many times with the same @var{string}, it is
1977better having many @var{request} variables instead.
1978
1979@item Actual recoding jobs
1980
1981Once the @var{request} variable holds the description of a recoding
1982transformation, a few functions use it for achieving an actual recoding.
1983Either input or output of a recoding may be string, an in-memory buffer,
1984or a file.
1985
1986Functions with names like
1987@code{recode_@var{input-type}_to_@var{output-type}} request an actual
1988recoding, and are described below.  It is easy to remember which arguments
1989each function accepts, once grasped some simple principles for each
1990possible @var{type}.  However, one of the recoding function escapes these
1991principles and is discussed separately, first.
1992
1993@example
1994recode_string (@var{request}, @var{string});
1995@end example
1996
1997@findex recode_string
1998The function @code{recode_string} recodes @var{string} according
1999to @var{request}, and directly returns the resulting recoded string
2000freshly allocated, or @code{NULL} if the recoding could not succeed for
2001some reason.  When this function is used, it is the responsibility of
2002the programmer to ensure that the memory used by the returned string is
2003later reclaimed.
2004
2005@findex recode_string_to_buffer
2006@findex recode_string_to_file
2007@findex recode_buffer_to_buffer
2008@findex recode_buffer_to_file
2009@findex recode_file_to_buffer
2010@findex recode_file_to_file
2011@example
2012char *recode_string_to_buffer (@var{request},
2013  @var{input_string},
2014  &@var{output_buffer}, &@var{output_length}, &@var{output_allocated});
2015bool recode_string_to_file (@var{request},
2016  @var{input_file},
2017  @var{output_file});
2018bool recode_buffer_to_buffer (@var{request},
2019  @var{input_buffer}, @var{input_length},
2020  &@var{output_buffer}, &@var{output_length}, &@var{output_allocated});
2021bool recode_buffer_to_file (@var{request},
2022  @var{input_buffer}, @var{input_length},
2023  @var{output_file});
2024bool recode_file_to_buffer (@var{request},
2025  @var{input_file},
2026  &@var{output_buffer}, &@var{output_length}, &@var{output_allocated});
2027bool recode_file_to_file (@var{request},
2028  @var{input_file},
2029  @var{output_file});
2030@end example
2031
2032All these functions return a @code{bool} result, @code{false} meaning that
2033the recoding was not successful, often because of reversibility issues.
2034The name of the function well indicates on which types it reads and which
2035type it produces.  Let's discuss these three types in turn.
2036
2037@table @asis
2038@item string
2039
2040A string is merely an in-memory buffer which is terminated by a @code{NUL}
2041character (using as many bytes as needed), instead of being described
2042by a byte length.  For input, a pointer to the buffer is given through
2043one argument.
2044
2045It is notable that there is no @code{to_string} functions.  Only one
2046function recodes into a string, and it is @code{recode_string}, which
2047has already been discussed separately, above.
2048
2049@item buffer
2050
2051A buffer is a sequence of bytes held in computer memory.  For input, two
2052arguments provide a pointer to the start of the buffer and its byte size.
2053Note that for charsets using many bytes per character, the size is given
2054in bytes, not in characters.
2055
2056For output, three arguments provide the address of three variables, which
2057will receive the buffer pointer, the used buffer size in bytes, and the
2058allocated buffer size in bytes.  If at the time of the call, the buffer
2059pointer is @code{NULL}, then the allocated buffer size should also be zero,
2060and the buffer will be allocated afresh by the recoding functions.  However,
2061if the buffer pointer is not @code{NULL}, it should be already allocated,
2062the allocated buffer size then gives its size.  If the allocated size
2063gets exceeded while the recoding goes, the buffer will be automatically
2064reallocated bigger, probably elsewhere, and the allocated buffer size will
2065be adjusted accordingly.
2066
2067The second variable, giving the in-memory buffer size, will receive the
2068exact byte size which was needed for the recoding.  A @code{NUL} character
2069is guaranteed at the end of the produced buffer, but is not counted in the
2070byte size of the recoding.  Beyond that @code{NUL}, there might be some
2071extra space after the recoded data, extending to the allocated buffer size.
2072
2073@item file
2074
2075@findex recode_filter_open@r{, not available}
2076@findex recode_filter_close@r{, not available}
2077A file is a sequence of bytes held outside computer memory, but
2078buffered through it.  For input, one argument provides a pointer to a
2079file already opened for read.  The file is then read and recoded from its
2080current position until the end of the file, effectively swallowing it in
2081memory if the destination of the recoding is a buffer.  For reading a file
2082filtered through the recoding library, but only a little bit at a time, one
2083should rather use @code{recode_filter_open} and @code{recode_filter_close}
2084(these two functions are not yet available).
2085
2086For output, one argument provides a pointer to a file already opened
2087for write.  The result of the recoding is written to that file starting
2088at its current position.
2089@end table
2090@end itemize
2091
2092@findex recode_format_table
2093The following special function is still subject to change:
2094
2095@example
2096void recode_format_table (@var{request}, @var{language}, "@var{name}");
2097@end example
2098
2099@noindent
2100and is not documented anymore for now.
2101
2102@node Task level, Charset level, Request level, Library
2103@section Task level functions
2104@cindex task level functions
2105
2106The task level functions are used internally by the request level
2107functions, they allow more explicit control over files and memory
2108buffers holding input and output to recoding processes.  The interface
2109specification of task level functions is still subject to change a bit.
2110
2111To get started with task level functions, here is a full example of a
2112program which sole job is to filter @code{ibmpc} code on its standard input
2113into @code{latin1} code on its standard output.  That is, this program has
2114the same goal as the one from the previous section, but does its things
2115a bit differently.
2116
2117@example
2118@group
2119#include <stdio.h>
2120#include <stdbool.h>
2121#include <recodext.h>
2122
2123const char *program_name;
2124
2125int
2126main (int argc, char *const *argv)
2127@{
2128  program_name = argv[0];
2129  RECODE_OUTER outer = recode_new_outer (false);
2130  RECODE_REQUEST request = recode_new_request (outer);
2131  RECODE_TASK task;
2132  bool success;
2133
2134  recode_scan_request (request, "ibmpc..latin1");
2135
2136  task = recode_new_task (request);
2137  task->input.file = "";
2138  task->output.file = "";
2139  success = recode_perform_task (task);
2140
2141  recode_delete_task (task);
2142  recode_delete_request (request);
2143  recode_delete_outer (outer);
2144
2145  exit (success ? 0 : 1);
2146@}
2147@end group
2148@end example
2149
2150@vindex RECODE_TASK structure
2151The header file @code{<recode.h>} declares a @code{RECODE_TASK}
2152structure, which the programmer should use for allocating a variable in
2153his program.  This @code{task} variable is given as a first argument to
2154all task level functions.  The programmer ought to change and possibly
2155consult a few fields in this structure, using special functions.
2156
2157@itemize @bullet
2158@item Initialisation functions
2159@cindex initialisation functions, task
2160
2161@findex recode_new_task
2162@findex recode_delete_task
2163@example
2164RECODE_TASK recode_new_task (@var{request});
2165bool recode_delete_task (@var{task});
2166@end example
2167
2168No @var{task} variable may be used in other task level functions
2169of the recoding library without having first been initialised with
2170@code{recode_new_task}.  There may be many such @var{task} variables,
2171in which case, they are independent of one another and they all need to be
2172initialised separately.  To avoid memory leaks, a @var{task} variable should
2173not be initialised a second time without calling @code{recode_delete_task} to
2174``un-initialise'' it.  This function also accepts a @var{request} argument
2175and associates the request to the task.  In fact, a task is essentially
2176a set of recoding transformations with the specification for its current
2177input and its current output.
2178
2179The @var{request} variable may be scanned before or after the call to
2180@code{recode_new_task}, it does not matter so far.  Immediately after
2181initialisation, before further changes, the @var{task} variable associates
2182@var{request} empty in-memory buffers for both input and output.
2183The output buffer will later get allocated automatically on the fly,
2184as needed, by various task processors.
2185
2186Even if a call to @code{recode_delete_task} is not strictly mandatory
2187before ending the program, it is cleaner to always include it.  Moreover,
2188in some future version of the recoding library, it might become required.
2189
2190@item Fields of @code{struct task_request}
2191@vindex task_request structure
2192
2193Here are the fields of a @code{struct task_request} which may be meaningfully
2194changed, once a @var{task} has been initialised by @code{recode_new_task}.
2195In fact, fields are expected to change.  Once again, to access the fields,
2196you need to include @file{recodext.h} @emph{instead} of @file{recode.h},
2197in which case there also is a greater chance that you need to recompile
2198your programs if a new version of the recoding library gets installed.
2199
2200@table @code
2201@item request
2202
2203The field @code{request} points to the current recoding request, but may
2204be changed as needed between recoding calls, for example when there is
2205a need to achieve the construction of a resulting text made up of many
2206pieces, each being recoded differently.
2207
2208@item input.name
2209@itemx input.file
2210
2211If @code{input.name} is not @code{NULL} at start of a recoding, this is
2212a request that a file by that name be first opened for reading and later
2213automatically closed once the whole file has been read. If the file name is
2214not @code{NULL} but an empty string, it means that standard input is to
2215be used.  The opened file pointer is then held into @code{input.file}.
2216
2217If @code{input.name} is @code{NULL} and @code{input.file} is not, than
2218@code{input.file} should point to a file already opened for read, which
2219is meant to be recoded.
2220
2221@item input.buffer
2222@itemx input.cursor
2223@itemx input.limit
2224
2225When both @code{input.name} and @code{input.file} are @code{NULL}, three
2226pointers describe an in-memory buffer containing the text to be recoded.
2227The buffer extends from @code{input.buffer} to @code{input.limit},
2228yet the text to be recoded only extends from @code{input.cursor} to
2229@code{input.limit}.  In most situations, @code{input.cursor} starts with
2230the value that @code{input.buffer} has.  (Its value will internally advance
2231as the recoding goes, until it reaches the value of @code{input.limit}.)
2232
2233@item output.name
2234@itemx output.file
2235
2236If @code{output.name} is not @code{NULL} at start of a recoding, this
2237is a request that a file by that name be opened for write and later
2238automatically closed after the recoding is done.  If the file name is
2239not @code{NULL} but an empty string, it means that standard output is to
2240be used.  The opened file pointer is then held into @code{output.file}.
2241If several passes with intermediate files are needed to produce the
2242recoding, the @code{output.name} file is opened only for the final pass.
2243
2244If @code{output.name} is @code{NULL} and @code{output.file} is not, then
2245@code{output.file} should point to a file already opened for write, which
2246will receive the result of the recoding.
2247
2248@item output.buffer
2249@itemx output.cursor
2250@itemx output.limit
2251
2252When both @code{output.name} and @code{output.file} are @code{NULL}, three
2253pointers describe an in-memory buffer meant to receive the text, once it
2254is recoded.  The buffer is already allocated from @code{output.buffer}
2255to @code{output.limit}.  In most situations, @code{output.cursor} starts
2256with the value that @code{output.buffer} has.  Once the recoding is done,
2257@code{output.cursor} will point at the next free byte in the buffer,
2258just after the recoded text, so another recoding could be called without
2259changing any of these three pointers, for appending new information to it.
2260The number of recoded bytes in the buffer is the difference between
2261@code{output.cursor} and @code{output.buffer}.
2262
2263Each time @code{output.cursor} reaches @code{output.limit}, the buffer
2264is reallocated bigger, possibly at a different location in memory, always
2265held up-to-date in @code{output.buffer}.  It is still possible to call a
2266task level function with no output buffer at all to start with, in which
2267case all three fields should have @code{NULL} as a value.  This is the
2268situation immediately after a call to @code{recode_new_task}.
2269
2270@item strategy
2271@vindex strategy
2272@vindex RECODE_STRATEGY_UNDECIDED
2273This field, which is of type @code{enum recode_sequence_strategy}, tells
2274how various recoding steps (passes) will be interconnected.  Its initial
2275value is @code{RECODE_STRATEGY_UNDECIDED}, which is a constant defined in
2276the header file @file{<recodext.h>}.  Other possible values are:
2277
2278@table @code
2279@item RECODE_SEQUENCE_IN_MEMORY
2280@vindex RECODE_SEQUENCE_IN_MEMORY
2281Keep intermediate recodings in memory.
2282@item RECODE_SEQUENCE_WITH_FILES
2283@vindex RECODE_SEQUENCE_WITH_FILES
2284Do not fork, use intermediate files.
2285@item RECODE_SEQUENCE_WITH_PIPE
2286@vindex RECODE_SEQUENCE_WITH_PIPE
2287Fork processes connected with @code{pipe(2)}.
2288@end table
2289
2290@c FIXME
2291The best for now is to leave this field alone, and let the recoding
2292library decide its strategy, as many combinations have not been tested yet.
2293
2294@item byte_order_mark
2295@vindex byte_order_mark
2296This field, which is preset to @code{true}, indicates that a byte order
2297mark is to be expected at the beginning of any canonical @code{UCS-2}
2298or @code{UTF-16} text, and that such a byte order mark should be also
2299produced for these charsets.
2300
2301@item fail_level
2302@vindex fail_level
2303This field, which is of type @code{enum recode_error} (@pxref{Errors}),
2304sets the error level at which task level functions should report a failure.
2305If an error being detected is equal or greater than @code{fail_level},
2306the function will eventually return @code{false} instead of @code{true}.
2307The preset value for this field is @code{RECODE_NOT_CANONICAL}, that means
2308that if not reset to another value, the library will report failure on
2309@emph{any} error.
2310
2311@item abort_level
2312@vindex abort_level
2313@vindex RECODE_MAXIMUM_ERROR
2314This field, which is of type @code{enum recode_error} (@pxref{Errors}), sets
2315the error level at which task level functions should immediately interrupt
2316their processing.  If an error being detected is equal or greater than
2317@code{abort_level}, the function returns immediately, but the returned
2318value (@code{true} or @code{false}) is still is decided from the setting
2319of @code{fail_level}, not @code{abort_level}.  The preset value for this
2320field is @code{RECODE_MAXIMUM_ERROR}, that means that is not reset to
2321another value, the library will never interrupt a recoding task.
2322
2323@item error_so_far
2324@vindex error_so_far
2325This field, which is of type @code{enum recode_error} (@pxref{Errors}),
2326maintains the maximum error level met so far while the recoding task
2327was proceeding.  The preset value is @code{RECODE_NO_ERROR}.
2328@end table
2329
2330@item Task execution
2331@cindex task execution
2332
2333@findex recode_perform_task
2334@findex recode_filter_open
2335@findex recode_filter_close
2336@example
2337recode_perform_task (@var{task});
2338recode_filter_open (@var{task}, @var{file});
2339recode_filter_close (@var{task});
2340@end example
2341
2342The function @code{recode_perform_task} reads as much input as possible,
2343and recode all of it on prescribed output, given a properly initialised
2344@var{task}.
2345
2346Functions @code{recode_filter_open} and @code{recode_filter_close} are
2347only planned for now.  They are meant to read input in piecemeal ways.
2348Even if functionality already exists informally in the library, it has
2349not been made available yet through such interface functions.
2350@end itemize
2351
2352@node Charset level, Errors, Task level, Library
2353@section Charset level functions
2354@cindex charset level functions
2355
2356@cindex internal functions
2357Many functions are internal to the recoding library.  Some of them
2358have been made external and available, for the @code{recode} program
2359had to retain all its previous functionality while being transformed
2360into a mere application of the recoding library.  These functions are
2361not really documented here for the time being, as we hope that many of
2362them will vanish over time.  When this set of routines will stabilise,
2363it would be convenient to document them as an API for handling charset
2364names and contents.
2365
2366@findex find_charset
2367@findex list_all_charsets
2368@findex list_concise_charset
2369@findex list_full_charset
2370@example
2371RECODE_CHARSET find_charset (@var{name}, @var{cleaning-type});
2372bool list_all_charsets (@var{charset});
2373bool list_concise_charset (@var{charset}, @var{list-format});
2374bool list_full_charset (@var{charset});
2375@end example
2376
2377@node Errors,  , Charset level, Library
2378@section Handling errors
2379@cindex error handling
2380@cindex handling errors
2381
2382@cindex error messages
2383The @code{recode} program, while using the @code{recode} library, needs to
2384control whether recoding problems are reported or not, and then reflect
2385these in the exit status.  The program should also instruct the library
2386whether the recoding should be abruptly interrupted when an error is
2387met (so sparing processing when it is known in advance that a wrong
2388result would be discarded anyway), or if it should proceed nevertheless.
2389Here is how the library groups errors into levels, listed here in order
2390of increasing severity.
2391
2392@table @code
2393@item RECODE_NO_ERROR
2394@vindex RECODE_NO_ERROR
2395
2396No error was met on previous library calls.
2397
2398@item RECODE_NOT_CANONICAL
2399@vindex RECODE_NOT_CANONICAL
2400@cindex non canonical input, error message
2401
2402The input text was using one of the many alternative codings for some
2403phenomenon, but not the one @code{recode} would have canonically generated.
2404So, if the reverse recoding is later attempted, it would produce a text
2405having the same @emph{meaning} as the original text, yet not being byte
2406identical.
2407
2408For example, a @code{Base64} block in which end-of-lines appear elsewhere
2409that at every 76 characters is not canonical.  An e-circumflex in @TeX{}
2410which is coded as @samp{\^@{e@}} instead of @samp{\^e} is not canonical.
2411
2412@item RECODE_AMBIGUOUS_OUTPUT
2413@vindex RECODE_AMBIGUOUS_OUTPUT
2414@cindex ambiguous output, error message
2415
2416It has been discovered that if the reverse recoding was attempted on
2417the text output by this recoding, we would not obtain the original text,
2418only because an ambiguity was generated by accident in the output text.
2419This ambiguity would then cause the wrong interpretation to be taken.
2420
2421Here are a few examples.  If the @code{Latin-1} sequence @samp{e^}
2422is converted to Easy French and back, the result will be interpreted
2423as e-circumflex and so, will not reflect the intent of the original two
2424characters.  Recoding an @code{IBM-PC} text to @code{Latin-1} and back,
2425where the input text contained an isolated @kbd{LF}, will have a spurious
2426@kbd{CR} inserted before the @kbd{LF}.
2427
2428Currently, there are many cases in the library where the production of
2429ambiguous output is not properly detected, as it is sometimes a difficult
2430problem to accomplish this detection, or to do it speedily.
2431
2432@item RECODE_UNTRANSLATABLE
2433@vindex RECODE_UNTRANSLATABLE
2434@cindex untranslatable input, error message
2435
2436One or more input character could not be recoded, because there is just
2437no representation for this character in the output charset.
2438
2439Here are a few examples.  Non-strict mode often allows @code{recode} to
2440compute on-the-fly mappings for unrepresentable characters, but strict
2441mode prohibits such attribution of reversible translations: so strict
2442mode might often trigger such an error.  Most @code{UCS-2} codes used to
2443represent Asian characters cannot be expressed in various Latin charsets.
2444
2445@item RECODE_INVALID_INPUT
2446@vindex RECODE_INVALID_INPUT
2447@cindex invalid input, error message
2448
2449The input text does not comply with the coding it is declared to hold.  So,
2450there is no way by which a reverse recoding would reproduce this text,
2451because @code{recode} should never produce invalid output.
2452
2453Here are a few examples.  In strict mode, @code{ASCII} text is not allowed
2454to contain characters with the eight bit set.  @code{UTF-8} encodings
2455ought to be minimal@footnote{The minimality of an @code{UTF-8} encoding
2456is guaranteed on output, but currently, it is not checked on input.}.
2457
2458@item RECODE_SYSTEM_ERROR
2459@vindex RECODE_SYSTEM_ERROR
2460@cindex system detected problem, error message
2461
2462The underlying system reported an error while the recoding was going on,
2463likely an input/output error.
2464(This error symbol is currently unused in the library.)
2465
2466@item RECODE_USER_ERROR
2467@vindex RECODE_USER_ERROR
2468@cindex misuse of recoding library, error message
2469
2470The programmer or user requested something the recoding library is unable
2471to provide, or used the API wrongly.
2472(This error symbol is currently unused in the library.)
2473
2474@item RECODE_INTERNAL_ERROR
2475@vindex RECODE_INTERNAL_ERROR
2476@cindex internal recoding bug, error message
2477
2478Something really wrong, which should normally never happen, was detected
2479within the recoding library.  This might be due to genuine bugs in the
2480library, or maybe due to un-initialised or overwritten arguments to
2481the API.
2482(This error symbol is currently unused in the library.)
2483
2484@item RECODE_MAXIMUM_ERROR
2485@vindex RECODE_MAXIMUM_ERROR
2486
2487This error code should never be returned, it is only internally used as
2488a sentinel for the list of all possible error codes.
2489@end table
2490
2491@cindex error level threshold
2492@cindex threshold for error reporting
2493One should be able to set the error level threshold for returning failure
2494at end of recoding, and also the threshold for immediate interruption.
2495If many errors occur while the recoding proceed, which are not severe
2496enough to interrupt the recoding, then the most severe error is retained,
2497while others are forgotten@footnote{Another approach would have been
2498to define the level symbols as masks instead, and to give masks to
2499threshold setting routines, and to retain all errors---yet I never
2500met myself such a need in practice, and so I fear it would be overkill.
2501On the other hand, it might be interesting to maintain counters about
2502how many times each kind of error occurred.}.  So, in case of an error,
2503the possible actions currently are:
2504
2505@itemize @bullet
2506@item do nothing and let go, returning success at end of recoding,
2507@item just let go for now, but return failure at end of recoding,
2508@item interrupt recoding right away and return failure now.
2509@end itemize
2510
2511@noindent
2512@xref{Task level}, and particularly the description of the fields
2513@code{fail_level}, @code{abort_level} and @code{error_so_far}, for more
2514information about how errors are handled.
2515
2516@ignore
2517@c FIXME: Take a look at these matters, indeed.
2518
2519A last topic around errors is where the error related fields are kept.
2520To work nicely with threads, my feeling is that the main API levels (based
2521on either of @code{struct recode_outer}, @code{struct recode_request}
2522or @code{struct recode_task}) should each have their error thresholds,
2523values, and last explicit message strings.  Thresholds would be inherited
2524by requests from outers, and by tasks from requests.  Error values and
2525strings would be automatically propagated out from tasks to requests,
2526for these request level routines which internally set up and use recoding
2527tasks.
2528
2529One simple way to avoid locking while sparing the initialisation of many
2530identical requests, a programmer could prepare the common request before
2531splitting threads, and merely @emph{copy} the @code{struct recode_request}
2532so each thread has its own copy---either using a mere assignment or
2533@code{memcpy}.  The same could be said for @code{struct recode_outer}
2534or @code{struct recode_task} blocks, yet it makes less sense to me to do
2535so in practice.
2536@end ignore
2537
2538@node Universal, libiconv, Library, Top
2539@chapter The universal charset
2540
2541@cindex ISO 10646
2542Standard @w{ISO 10646} defines a universal character set, intended to encompass
2543in the long run all languages written on this planet.  It is based on
2544wide characters, and offer possibilities for two billion characters
2545(@math{2^31}).
2546
2547@tindex UCS
2548This charset was to become available in @code{recode} under the name
2549@code{UCS}, with many external surfaces for it.  But in the current
2550version, only surfaces of @code{UCS} are offered, each presented as a
2551genuine charset rather than a surface.  Such surfaces are only meaningful
2552for the @code{UCS} charset, so it is not that useful to draw a line
2553between the surfaces and the only charset to which they may apply.
2554
2555@tindex UTF-1
2556@code{UCS} stands for Universal Character Set.  @code{UCS-2} and
2557@code{UCS-4} are fixed length encodings, using two or four bytes per
2558character respectively.  @code{UTF} stands for @code{UCS} Transformation
2559Format, and are variable length encodings dedicated to @code{UCS}.
2560@code{UTF-1} was based on @w{ISO 2022}, it did not succeed@footnote{It is not
2561probable that @code{recode} will ever support @code{UTF-1}.}.  @code{UTF-2}
2562replaced it, it has been called @code{UTF-FSS} (File System Safe) in
2563Unicode or Plan9 context, but is better known today as @code{UTF-8}.
2564To complete the picture, there is @code{UTF-16} based on 16 bits bytes,
2565and @code{UTF-7} which is meant for transmissions limited to 7-bit bytes.
2566Most often, one might see @code{UTF-8} used for external storage, and
2567@code{UCS-2} used for internal storage.
2568
2569@c FIXME: the manual never explains what the U+NNNN notation means!
2570When @code{recode} is producing any representation of @code{UCS},
2571it uses the replacement character @code{U+FFFD} for any @emph{valid}
2572character which is not representable in the goal charset@footnote{This
2573is when the goal charset allows for 16-bits.  For shorter charsets,
2574the @samp{--strict} (@samp{-s}) option decides what happens: either the
2575character is dropped, or a reversible mapping is produced on the fly.}.
2576This happens, for example, when @code{UCS-2} is not capable to echo a
2577wide @code{UCS-4} character, or for a similar reason, an @code{UTF-8}
2578sequence using more than three bytes.  The replacement character is
2579meant to represent an existing character.  So, it is never produced to
2580represent an invalid sequence or ill-formed character in the input text.
2581In such cases, @code{recode} just gets rid of the noise, while taking note
2582of the error in its usual ways.
2583
2584Even if @code{UTF-8} is an encoding, really, it is the encoding of a single
2585character set, and nothing else.  It is useful to distinguish between an
2586encoding (a @emph{surface} within @code{recode}) and a charset, but only
2587when the surface may be applied to several charsets.  Specifying a charset
2588is a bit simpler than specifying a surface in a @code{recode} request.
2589There would not be a practical advantage at imposing a more complex syntax
2590to @code{recode} users, when it is simple to assimilate @code{UTF-8} to
2591a charset.  Similar considerations apply for @code{UCS-2}, @code{UCS-4},
2592@code{UTF-16} and @code{UTF-7}.  These are all considered to be charsets.
2593
2594@menu
2595* UCS-2::               Universal Character Set, 2 bytes
2596* UCS-4::               Universal Character Set, 4 bytes
2597* UTF-7::               Universal Transformation Format, 7 bits
2598* UTF-8::               Universal Transformation Format, 8 bits
2599* UTF-16::              Universal Transformation Format, 16 bits
2600* count-characters::    Frequency count of characters
2601* dump-with-names::     Fully interpreted UCS dump
2602@end menu
2603
2604@node UCS-2, UCS-4, Universal, Universal
2605@section Universal Character Set, 2 bytes
2606
2607@tindex UCS-2
2608@cindex Unicode
2609One surface of @code{UCS} is usable for the subset defined by its first
2610sixty thousand characters (in fact, @math{31 * 2^11} codes), and uses
2611exactly two bytes per character.  It is a mere dump of the internal
2612memory representation which is @emph{natural} for this subset and as such,
2613conveys with it endianness problems.
2614
2615@cindex byte order mark
2616A non-empty @code{UCS-2} file normally begins with a so called @dfn{byte
2617order mark}, having value @code{0xFEFF}.  The value @code{0xFFFE} is not an
2618@code{UCS} character, so if this value is seen at the beginning of a file,
2619@code{recode} reacts by swapping all pairs of bytes.  The library also
2620properly reacts to other occurrences of @code{0xFEFF} or @code{0xFFFE}
2621elsewhere than at the beginning, because concatenation of @code{UCS-2}
2622files should stay a simple matter, but it might trigger a diagnostic
2623about non canonical input.
2624
2625By default, when producing an @code{UCS-2} file, @code{recode} always
2626outputs the high order byte before the low order byte.  But this could be
2627easily overridden through the @code{21-Permutation} surface
2628(@pxref{Permutations}).  For example, the command:
2629
2630@example
2631recode u8..u2/21 < @var{input} > @var{output}
2632@end example
2633
2634@noindent
2635asks for an @code{UTF-8} to @code{UCS-2} conversion, with swapped byte
2636output.
2637
2638@tindex ISO-10646-UCS-2, and aliases
2639@tindex BMP
2640@tindex rune
2641@tindex u2
2642Use @code{UCS-2} as a genuine charset.  This charset is available in
2643@code{recode} under the name @code{ISO-10646-UCS-2}.  Accepted aliases
2644are @code{UCS-2}, @code{BMP}, @code{rune} and @code{u2}.
2645
2646@tindex combined-UCS-2
2647@cindex combining characters
2648The @code{recode} library is able to combine @code{UCS-2} some sequences
2649of codes into single code characters, to represent a few diacriticized
2650characters, ligatures or diphtongs which have been included to ease
2651mapping with other existing charsets.  It is also able to explode
2652such single code characters into the corresponding sequence of codes.
2653The request syntax for triggering such operations is rudimentary and
2654temporary.  The @code{combined-UCS-2} pseudo character set is a special
2655form of @code{UCS-2} in which known combinings have been replaced by the
2656simpler code.  Using @code{combined-UCS-2} instead of @code{UCS-2} in an
2657@emph{after} position of a request forces a combining step, while using
2658@code{combined-UCS-2} instead of @code{UCS-2} in a @emph{before} position
2659of a request forces an exploding step.  For the time being, one has to
2660resort to advanced request syntax to achieve other effects.  For example:
2661
2662@example
2663recode u8..co,u2..u8 < @var{input} > @var{output}
2664@end example
2665
2666@noindent
2667copies an @code{UTF-8} @var{input} over @var{output}, still to be in
2668@code{UTF-8}, yet merging combining characters into single codes whenever
2669possible.
2670
2671@node UCS-4, UTF-7, UCS-2, Universal
2672@section Universal Character Set, 4 bytes
2673
2674@tindex UCS-4
2675Another surface of @code{UCS} uses exactly four bytes per character, and is
2676a mere dump of the internal memory representation which is @emph{natural}
2677for the whole charset and as such, conveys with it endianness problems.
2678
2679@tindex ISO-10646-UCS-4, and aliases
2680@tindex ISO_10646
2681@tindex 10646
2682@tindex u4
2683Use it as a genuine charset.  This charset is available in @code{recode}
2684under the name @code{ISO-10646-UCS-4}.  Accepted aliases are @code{UCS},
2685@code{UCS-4}, @code{ISO_10646}, @code{10646} and @code{u4}.
2686
2687@node UTF-7, UTF-8, UCS-4, Universal
2688@section Universal Transformation Format, 7 bits
2689
2690@tindex UTF-7
2691@code{UTF-7} comes from IETF rather than ISO, and is described by @w{RFC
26922152}, in the MIME series.  The @code{UTF-7} encoding is meant to fit
2693@code{UCS-2} over channels limited to seven bits per byte.  It proceeds
2694from a mix between the spirit of @code{Quoted-Printable} and methods of
2695@code{Base64}, adapted to Unicode contexts.
2696
2697@tindex UNICODE-1-1-UTF-7, and aliases
2698@tindex TF-7
2699@tindex u7
2700This charset is available in @code{recode} under the name
2701@code{UNICODE-1-1-UTF-7}.  Accepted aliases are @code{UTF-7}, @code{TF-7}
2702and @code{u7}.
2703
2704@node UTF-8, UTF-16, UTF-7, Universal
2705@section Universal Transformation Format, 8 bits
2706
2707@tindex UTF-8
2708Even if @code{UTF-8} does not originally come from IETF, there is now
2709@w{RFC 2279} to describe it.  In letters sent on 1995-01-21 and 1995-04-20,
2710Markus Kuhn writes:
2711
2712@quotation
2713@code{UTF-8} is an @code{ASCII} compatible multi-byte encoding of the @w{ISO
271410646} universal character set (@code{UCS}).  @code{UCS} is a 31-bit superset
2715of all other character set standards.  The first 256 characters of @code{UCS}
2716are identical to those of @w{ISO 8859-1} (@w{Latin-1}).  The @code{UCS-2}
2717encoding of UCS is a sequence of bigendian 16-bit words, the @code{UCS-4}
2718encoding is a sequence of bigendian 32-bit words.  The @code{UCS-2} subset
2719of @w{ISO 10646} is also known as ``Unicode''.  As both @code{UCS-2}
2720and @code{UCS-4} require heavy modifications to traditional @code{ASCII}
2721oriented system designs (e.g. Unix), the @code{UTF-8} encoding has been
2722designed for these applications.
2723
2724In @code{UTF-8}, only @code{ASCII} characters are encoded using bytes
2725below 128.  All other non-ASCII characters are encoded as multi-byte
2726sequences consisting only of bytes in the range 128-253.  This avoids
2727critical bytes like @kbd{NUL} and @kbd{/} in @code{UTF-8} strings, which
2728makes the @code{UTF-8} encoding suitable for being handled by the standard
2729C string library and being used in Unix file names.  Other properties
2730include the preserved lexical sorting order and that @code{UTF-8} allows
2731easy self-synchronisation of software receiving @code{UTF-8} strings.
2732@end quotation
2733
2734@code{UTF-8} is the most common external surface of @code{UCS}, each
2735character uses from one to six bytes, and is able to encode all @math{2^31}
2736characters of the @code{UCS}.  It is implemented as a charset, with the
2737following properties:
2738
2739@itemize @bullet
2740@item
2741Strict 7-bit @code{ASCII} is completely invariant under @code{UTF-8},
2742and those are the only one-byte characters.  @code{UCS} values and
2743@code{ASCII} values coincide.  No multi-byte characters ever contain bytes
2744less than 128.  @code{NUL} @emph{is} @code{NUL}.  A multi-byte character
2745always starts with a byte of 192 or more, and is always followed by a
2746number of bytes between 128 to 191.  That means that you may read at
2747random on disk or memory, and easily discover the start of the current,
2748next or previous character.  You can count, skip or extract characters
2749with this only knowledge.
2750
2751@item
2752If you read the first byte of a multi-byte character in binary, it contains
2753many @samp{1} bits in successions starting with the most significant one
2754(from the left), at least two.  The length of this @samp{1} sequence equals
2755the byte size of the character.  All succeeding bytes start by @samp{10}.
2756This is a lot of redundancy, making it fairly easy to guess that a file
2757is valid @code{UTF-8}, or to safely state that it is not.
2758
2759@item
2760In a multi-byte character, if you remove all leading @samp{1} bits of the
2761first byte of a multi-byte character, and the initial @samp{10} bits of
2762all remaining bytes (so keeping 6 bits per byte for those), the remaining
2763bits concatenated are the UCS value.
2764@end itemize
2765
2766@noindent
2767These properties also have a few nice consequences:
2768
2769@itemize @bullet
2770@item
2771Conversion to/from values is algorithmically simple, and reasonably speedy.
2772
2773@item
2774A sequence of @var{N} bytes can hold characters needing up to 2 + 5@var{N}
2775bits in their @code{UCS} representation.  Here, @var{N} is a number between
27761 and 6.  So, @code{UTF-8} is most economical when mapping ASCII (1 byte),
2777followed by @code{UCS-2} (1 to 3 bytes) and @code{UCS-4} (1 to 6 bytes).
2778
2779@item
2780The lexicographic sorting order of @code{UCS} strings is preserved.
2781
2782@item
2783Bytes with value 254 or 255 never appear, and because of that, these are
2784sometimes used when escape mechanisms are needed.
2785@end itemize
2786
2787In some case, when little processing is done on a lot of strings, one may
2788choose for efficiency reasons to handle @code{UTF-8} strings directly even
2789if variable length, as it is easy to get start of characters.  Character
2790insertion or replacement might require moving the remainder of the string
2791in either direction.  In most cases, it is faster and easier to convert
2792from @code{UTF-8} to @code{UCS-2} or @code{UCS-4} prior to processing.
2793
2794@tindex UTF-8, aliases
2795@tindex UTF-FSS
2796@tindex FSS_UTF
2797@tindex TF-8
2798@tindex u8
2799This charset is available in @code{recode} under the name @code{UTF-8}.
2800Accepted aliases are @code{UTF-2}, @code{UTF-FSS}, @code{FSS_UTF},
2801@code{TF-8} and @code{u8}.
2802
2803@node UTF-16, count-characters, UTF-8, Universal
2804@section Universal Transformation Format, 16 bits
2805
2806@tindex UTF-16, and aliases
2807Another external surface of @code{UCS} is also variable length, each
2808character using either two or four bytes.  It is usable for the subset
2809defined by the first million characters (@math{17 * 2^16}) of @code{UCS}.
2810
2811Martin J. D@"urst writes (to @uref{comp.std.internat}, on 1995-03-28):
2812
2813@quotation
2814@code{UTF-16} is another method that reserves two times 1024 codepoints in
2815Unicode and uses them to index around one million additional characters.
2816@code{UTF-16} is a little bit like former multibyte codes, but quite
2817not so, as both the first and the second 16-bit code clearly show what
2818they are.  The idea is that one million codepoints should be enough for
2819all the rare Chinese ideograms and historical scripts that do not fit
2820into the Base Multilingual Plane of @w{ISO 10646} (with just about 63,000
2821positions available, now that 2,000 are gone).
2822@end quotation
2823
2824@tindex Unicode, an alias for UTF-16
2825@tindex TF-16
2826@tindex u6
2827This charset is available in @code{recode} under the name @code{UTF-16}.
2828Accepted aliases are @code{Unicode}, @code{TF-16} and @code{u6}.
2829
2830@node count-characters, dump-with-names, UTF-16, Universal
2831@section Frequency count of characters
2832
2833@tindex count-characters
2834@cindex counting characters
2835A device may be used to obtain a list of characters in a file, and how many
2836times each character appears.  Each count is followed by the @code{UCS-2}
2837value of the character and, when known, the @w{RFC 1345} mnemonic for that
2838character.
2839
2840This charset is available in @code{recode} under the name
2841@code{count-characters}.
2842
2843This @code{count} feature has been implemented as a charset.  This may
2844change in some later version, as it would sometimes be convenient to count
2845original bytes, instead of their @code{UCS-2} equivalent.
2846
2847@node dump-with-names,  , count-characters, Universal
2848@section Fully interpreted UCS dump
2849
2850@tindex dump-with-names
2851@cindex dumping characters, with description
2852@cindex character streams, description
2853@cindex description of individual characters
2854Another device may be used to get fully interpreted dumps of an @code{UCS-2}
2855stream of characters, with one @code{UCS-2} character displayed on a full
2856output line.  Each line receives the @w{RFC 1345} mnemonic for the character
2857if it exists, the @code{UCS-2} value of the character, and a descriptive
2858comment for that character.  As each input character produces its own
2859output line, beware that the output file from this conversion may be much,
2860much bigger than the input file.
2861
2862This charset is available in @code{recode} under the name
2863@code{dump-with-names}.
2864
2865This @code{dump-with-names} feature has been implemented as a charset rather
2866than a surface.  This is surely debatable.  The current implementation
2867allows for dumping charsets other than @code{UCS-2}.  For example, the
2868command @w{@samp{recode l2..full < @var{input}}} implies a necessary
2869conversion from @code{Latin-2} to @code{UCS-2}, as @code{dump-with-names}
2870is only connected out from @code{UCS-2}.  In such cases, @code{recode}
2871does not display the original @code{Latin-2} codes in the dump, only the
2872corresponding @code{UCS-2} values.  To give a simpler example, the command
2873
2874@example
2875echo 'Hello, world!' | recode us..dump
2876@end example
2877
2878@noindent
2879produces the following output:
2880
2881@example
2882UCS2   Mne   Description
2883
28840048   H     latin capital letter h
28850065   e     latin small letter e
2886006C   l     latin small letter l
2887006C   l     latin small letter l
2888006F   o     latin small letter o
2889002C   ,     comma
28900020   SP    space
28910077   w     latin small letter w
2892006F   o     latin small letter o
28930072   r     latin small letter r
2894006C   l     latin small letter l
28950064   d     latin small letter d
28960021   !     exclamation mark
2897000A   LF    line feed (lf)
2898@end example
2899
2900The descriptive comment is given in English and @code{ASCII},
2901yet if the English description is not available but a French one is, then
2902the French description is given instead, using @code{Latin-1}.  However,
2903if the @code{LANGUAGE} or @code{LANG} environment variable begins with
2904the letters @samp{fr}, then listing preference goes to French when both
2905descriptions are available.
2906
2907Here is another example.  To get the long description of the code 237 in
2908@w{Latin-5} table, one may use the following command.
2909
2910@example
2911echo -n 237 | recode l5/d..dump
2912@end example
2913
2914@noindent
2915If your @code{echo} does not grok @samp{-n}, use @samp{echo 237\c} instead.
2916Here is how to see what Unicode @code{U+03C6} means, while getting rid of
2917the title lines.
2918
2919@example
2920echo -n 0x03C6 | recode u2/x2..dump | tail +3
2921@end example
2922
2923@node libiconv, Tabular, Universal, Top
2924@chapter The @code{iconv} library
2925
2926@cindex @code{iconv} library
2927@cindex library, @code{iconv}
2928@cindex @code{libiconv}
2929@cindex interface, with @code{iconv} library
2930@cindex Haible, Bruno
2931The @code{recode} library itself contains most code and tables from the
2932portable @code{iconv} library, written by Bruno Haible.  In fact, many
2933capabilities of the @code{recode} library are duplicated because of this
2934merging, as the older @code{recode} and @code{iconv} libraries share many
2935charsets.  We discuss, here, the issues related to this duplication, and
2936other peculiarities specific to the @code{iconv} library.  The plan is to
2937remove duplications and better merge specificities, as @code{recode} evolves.
2938
2939As implemented, if a recoding request can be satisfied by the @code{recode}
2940library both with and without its @code{iconv} library part, it is likely
2941that the @code{iconv} library will be used.  To sort out if the @code{iconv}
2942is indeed used of not, just use the @samp{-v} or @samp{--verbose} option,
2943@pxref{Recoding}.
2944
2945@tindex libiconv
2946The @code{:libiconv:} charset represents a conceptual pivot charset
2947within the @code{iconv} part of the @code{recode} library (in fact,
2948this pivot exists, but is not directly reachable).  This charset has a
2949mere @code{:} (a colon) for an alias.  It is not allowed to recode from
2950or to this charset directly.  But when this charset is selected as an
2951intermediate, usually by automatic means, then the @code{iconv} part
2952of the @code{recode} library is called to handle the transformations.
2953By using an @samp{--ignore=:libiconv:} option on the @code{recode} call
2954or equivalently, but more simply, @samp{-x:}, @code{recode} is instructed
2955to fully avoid this charset as an intermediate, with the consequence that
2956the @code{iconv} part of the library is defeated.  Consider these two calls:
2957
2958@example
2959recode l1..1250 < @var{input} > @var{output}
2960recode -x: l1..1250 < @var{input} > @var{output}
2961@end example
2962
2963@noindent
2964Both should transform @var{input} from @code{ISO-8859-1} to @code{CP1250}
2965on @var{output}.  The first call uses the @code{iconv} part of the library,
2966while the second call avoids it.  Whatever the path used, the results should
2967normally be identical.  However, there might be observable differences.
2968Most of them might result from reversibility issues, as the @code{iconv}
2969engine, which the @code{recode} library directly uses for the time being,
2970does not address reversibility.  Even if much less likely, some differences
2971might result from slight errors in the tables used, such differences should
2972then be reported as bugs.
2973
2974Other irregularities might be seen in the area of error detection and
2975recovery.  The @code{recode} library usually tries to detect canonicity
2976errors in input, and production of ambiguous output, but the @code{iconv}
2977part of the library currently does not.  Input is always validated, however.
2978The @code{recode} library may not always react properly when its @code{iconv}
2979part has no translation for a given character.
2980
2981Within a collection of names for a single charset, the @code{recode}
2982library distinguishes one of them as being the genuine charset name,
2983while the others are said to be aliases.  When @code{recode} lists all
2984charsets, for example with the @samp{-l} or @samp{--list} option, the list
2985integrates all @code{iconv} library charsets.  The selection of one of the
2986aliases as the genuine charset name is an artifact added by @code{recode},
2987it does not come from @code{iconv}.  Moreover, the @code{recode} library
2988dynamically resolves some conflicts when it initialises itself at runtime.
2989This might explain some discrepancies in the table below, as for what is
2990the genuine charset name.
2991
2992@include libiconv.texi
2993
2994@node Tabular, ASCII misc, libiconv, Top
2995@chapter Tabular sources (@w{RFC 1345})
2996
2997@cindex RFC 1345
2998@cindex character mnemonics, documentation
2999@cindex @code{chset} tools
3000An important part of the tabular charset knowledge in @code{recode}
3001comes from @w{RFC 1345} or, alternatively, from the @code{chset} tools,
3002both maintained by Keld Simonsen.  The @w{RFC 1345} document:
3003
3004@quotation
3005``Character Mnemonics & Character Sets'', K. Simonsen, Request for
3006Comments no. 1345, Network Working Group, June 1992.
3007@end quotation
3008
3009@noindent
3010@cindex deviations from RFC 1345
3011defines many character mnemonics and character sets.  The @code{recode}
3012library implements most of @w{RFC 1345}, however:
3013
3014@itemize @bullet
3015@item
3016@tindex dk-us@r{, not recognised by }recode
3017@tindex us-dk@r{, not recognised by }recode
3018It does not recognise those charsets which overload character positions:
3019@code{dk-us} and @code{us-dk}.  However, @xref{Mixed}.
3020
3021@item
3022@tindex ANSI_X3.110-1983@r{, not recognised by }recode
3023@tindex ISO_6937-2-add@r{, not recognised by }recode
3024@tindex T.101-G2@r{, not recognised by }recode
3025@tindex T.61-8bit@r{, not recognised by }recode
3026@tindex iso-ir-90@r{, not recognised by }recode
3027It does not recognise those charsets which combine two characters for
3028representing a third: @code{ANSI_X3.110-1983}, @code{ISO_6937-2-add},
3029@code{T.101-G2}, @code{T.61-8bit}, @code{iso-ir-90} and
3030@code{videotex-suppl}.
3031
3032@item
3033@tindex GB_2312-80@r{, not recognised by }recode
3034@tindex JIS_C6226-1978@r{, not recognised by }recode
3035@tindex JIS_X0212-1990@r{, not recognised by }recode
3036@tindex KS_C_5601-1987@r{, not recognised by }recode
3037It does not recognise 16-bits charsets: @code{GB_2312-80},
3038@code{JIS_C6226-1978}, @code{JIS_C6226-1983}, @code{JIS_X0212-1990} and
3039@code{KS_C_5601-1987}.
3040
3041@item
3042@tindex isoir91
3043@tindex isoir92
3044It interprets the charset @code{isoir91} as @code{NATS-DANO} (alias
3045@code{iso-ir-9-1}), @emph{not} as @code{JIS_C6229-1984-a} (alias
3046@code{iso-ir-91}).  It also interprets the charset @code{isoir92}
3047as @code{NATS-DANO-ADD} (alias @code{iso-ir-9-2}), @emph{not} as
3048@code{JIS_C6229-1984-b} (alias @code{iso-ir-92}).  It might be better
3049just avoiding these two alias names.
3050@end itemize
3051
3052Keld Simonsen @email{keld@@dkuug.dk} did most of @w{RFC 1345} himself, with
3053some funding from Danish Standards and Nordic standards (INSTA) project.
3054He also did the character set design work, with substantial input from
3055Olle Jaernefors.  Keld typed in almost all of the tables, some have been
3056contributed.  A number of people have checked the tables in various
3057ways.  The RFC lists a number of people who helped.
3058
3059@cindex @code{recode}, and RFC 1345
3060Keld and the @code{recode} maintainer have an arrangement by which any new
3061discovered information submitted by @code{recode} users, about tabular
3062charsets, is forwarded to Keld, eventually merged into Keld's work,
3063and only then, reimported into @code{recode}.  Neither the @code{recode}
3064program nor its library try to compete, nor even establish themselves as
3065an alternate or diverging reference: @w{RFC 1345} and its new drafts stay the
3066genuine source for most tabular information conveyed by @code{recode}.
3067Keld has been more than collaborative so far, so there is no reason that
3068we act otherwise.  In a word, @code{recode} should be perceived as the
3069application of external references, but not as a reference in itself.
3070
3071@tindex RFC1345@r{, a charset, and its aliases}
3072@tindex 1345
3073@tindex mnemonic@r{, an alias for RFC1345 charset}
3074Internally, @w{RFC 1345} associates which each character an unambiguous
3075mnemonic of a few characters, taken from @w{ISO 646}, which is a minimal
3076ASCII subset of 83 characters.  The charset made up by these mnemonics
3077is available in @code{recode} under the name @code{RFC1345}.  It has
3078@code{mnemonic} and @code{1345} for aliases.  As implemened, this charset
3079exactly corresponds to @code{mnemonic+ascii+38}, using @w{RFC 1345}
3080nomenclature.  Roughly said, @w{ISO 646} characters represent themselves,
3081except for the ampersand (@kbd{&}) which appears doubled.  A prefix of a
3082single ampersand introduces a mnemonic.  For mnemonics using two characters,
3083the prefix is immediately by the mnemonic.  For longer mnemonics, the prefix
3084is followed by an underline (@kbd{_}), the mmemonic, and another underline.
3085Conversions to this charset are usually reversible.
3086
3087Currently, @code{recode} does not offer any of the many other possible
3088variations of this family of representations.  They will likely be
3089implemented in some future version, however.
3090
3091@table @code
3092@include rfc1345.texi
3093@end table
3094
3095@node ASCII misc, IBM and MS, Tabular, Top
3096@chapter ASCII and some derivatives
3097
3098@menu
3099* ASCII::               Usual ASCII
3100* ISO 8859::            ASCII extended by Latin Alphabets
3101* ASCII-BS::            ASCII 7-bits, @kbd{BS} to overstrike
3102* flat::                ASCII without diacritics nor underline
3103@end menu
3104
3105@node ASCII, ISO 8859, ASCII misc, ASCII misc
3106@section Usual ASCII
3107
3108@tindex ASCII@r{, an alias for the }ANSI_X3.4-1968@r{ charset}
3109@tindex ANSI_X3.4-1968@r{, and its aliases}
3110@tindex IBM367
3111@tindex US-ASCII
3112@tindex cp367
3113@tindex iso-ir-6
3114@tindex us
3115This charset is available in @code{recode} under the name @code{ASCII}.
3116In fact, it's true name is @code{ANSI_X3.4-1968} as per @w{RFC 1345},
3117accepted aliases being @code{ANSI_X3.4-1986}, @code{ASCII},
3118@code{IBM367}, @code{ISO646-US}, @code{ISO_646.irv:1991},
3119@code{US-ASCII}, @code{cp367}, @code{iso-ir-6} and @code{us}.  The
3120shortest way of specifying it in @code{recode} is @code{us}.
3121
3122@cindex ASCII table, recreating with @code{recode}
3123This documentation used to include ASCII tables.  They have been removed
3124since the @code{recode} program can now recreate these easily:
3125
3126@example
3127recode -lf us                   for commented ASCII
3128recode -ld us                   for concise decimal table
3129recode -lo us                   for concise octal table
3130recode -lh us                   for concise hexadecimal table
3131@end example
3132
3133@node ISO 8859, ASCII-BS, ASCII, ASCII misc
3134@section ASCII extended by Latin Alphabets
3135
3136@cindex Latin charsets
3137There are many Latin charsets.  The following has been written by Tim
3138Lasko @email{lasko@@video.dec.com}, a long while ago:
3139
3140@quotation
3141ISO @w{Latin-1}, or more completely ISO Latin Alphabet No 1, is now an
3142international standard as of February 1987 (IS 8859, Part 1).  For those
3143American USEnet'rs that care, the 8-bit ASCII standard, which is essentially
3144the same code, is going through the final administrative processes prior
3145to publication.  ISO @w{Latin-1} (IS 8859/1) is actually one of an entire
3146family of eight-bit one-byte character sets, all having ASCII on the left
3147hand side, and with varying repertoires on the right hand side:
3148
3149@itemize @bullet
3150@item
3151Latin Alphabet No 1 (caters to Western Europe - now approved).
3152@item
3153Latin Alphabet No 2 (caters to Eastern Europe - now approved).
3154@item
3155Latin Alphabet No 3 (caters to SE Europe + others - in draft ballot).
3156@item
3157Latin Alphabet No 4 (caters to Northern Europe - in draft ballot).
3158@item
3159Latin-Cyrillic alphabet (right half all Cyrillic - processing currently
3160suspended pending USSR input).
3161@item
3162Latin-Arabic alphabet (right half all Arabic - now approved).
3163@item
3164Latin-Greek alphabet (right half Greek + symbols - in draft ballot).
3165@item
3166Latin-Hebrew alphabet (right half Hebrew + symbols - proposed).
3167@end itemize
3168@end quotation
3169
3170@tindex Latin-1
3171The ISO Latin Alphabet 1 is available as a charset in @code{recode} under
3172the name @code{Latin-1}.  In fact, it's true name is @code{ISO_8859-1:1987}
3173as per @w{RFC 1345}, accepted aliases being @code{CP819}, @code{IBM819},
3174@code{ISO-8859-1}, @code{ISO_8859-1}, @code{iso-ir-100}, @code{l1}
3175and @code{Latin-1}.  The shortest way of specifying it in @code{recode}
3176is @code{l1}.
3177
3178@cindex Latin-1 table, recreating with @code{recode}
3179It is an eight-bit code which coincides with ASCII for the lower half.
3180This documentation used to include @w{Latin-1} tables.  They have been removed
3181since the @code{recode} program can now recreate these easily:
3182
3183@example
3184recode -lf l1                   for commented ISO Latin-1
3185recode -ld l1                   for concise decimal table
3186recode -lo l1                   for concise octal table
3187recode -lh l1                   for concise hexadecimal table
3188@end example
3189
3190@node ASCII-BS, flat, ISO 8859, ASCII misc
3191@section ASCII 7-bits, @kbd{BS} to overstrike
3192
3193@tindex ASCII-BS@r{, and its aliases}
3194@tindex BS@r{, an alias for }ASCII-BS@r{ charset}
3195This charset is available in @code{recode} under the name
3196@code{ASCII-BS}, with @code{BS} as an acceptable alias.
3197
3198@cindex diacritics, with @code{ASCII-BS} charset
3199The file is straight ASCII, seven bits only.  According to the definition
3200of ASCII, diacritics are applied by a sequence of three characters: the
3201letter, one @kbd{BS}, the diacritic mark.  We deviate slightly from this
3202by exchanging the diacritic mark and the letter so, on a screen device, the
3203diacritic will disappear and let the letter alone.  At recognition time,
3204both methods are acceptable.
3205
3206The French quotes are coded by the sequences: @w{@kbd{< BS "}} or @w{@kbd{"
3207BS <}} for the opening quote and @w{@kbd{> BS "}} or @w{@kbd{" BS >}}
3208for the closing quote.  This artificial convention was inherited in
3209straight @code{ASCII-BS} from habits around @code{Bang-Bang} entry, and
3210is not well known.  But we decided to stick to it so that @code{ASCII-BS}
3211charset will not lose French quotes.
3212
3213The @code{ASCII-BS} charset is independent of @code{ASCII}, and
3214different.  The following examples demonstrate this, knowing at advance
3215that @samp{!2} is the @code{Bang-Bang} way of representing an @kbd{e}
3216with an acute accent.  Compare:
3217
3218@example
3219% echo \!2 | recode -v bang..l1/d
3220Request: Bang-Bang..ISO-8859-1/Decimal-1
3221233,  10
3222@end example
3223
3224@noindent
3225with:
3226
3227@example
3228% echo \!2 | recode -v bang..bs/d
3229Request: Bang-Bang..ISO-8859-1..ASCII-BS/Decimal-1
3230 39,   8, 101,  10
3231@end example
3232
3233In the first case, the @kbd{e} with an acute accent is merely
3234transmitted by the @code{Latin-1..ASCII} mapping, not having a special
3235recoding rule for it.  In the @code{Latin-1..ASCII-BS} case, the acute
3236accent is applied over the @kbd{e} with a backspace: diacriticised
3237characters have special rules.  For the @code{ASCII-BS} charset,
3238reversibility is still possible, but there might be difficult cases.
3239
3240@node flat,  , ASCII-BS, ASCII misc
3241@section ASCII without diacritics nor underline
3242@tindex flat@r{, a charset}
3243
3244This charset is available in @code{recode} under the name @code{flat}.
3245
3246@cindex diacritics and underlines, removing
3247@cindex removing diacritics and underlines
3248This code is ASCII expunged of all diacritics and underlines, as long as
3249they are applied using three character sequences, with @kbd{BS} in the
3250middle.  Also, despite slightly unrelated, each control character is
3251represented by a sequence of two or three graphic characters.  The newline
3252character, however, keeps its functionality and is not represented.
3253
3254Note that charset @code{flat} is a terminal charset.  We can convert
3255@emph{to} @code{flat}, but not @emph{from} it.
3256
3257@node IBM and MS, CDC, ASCII misc, Top
3258@chapter Some IBM or Microsoft charsets
3259
3260@cindex IBM codepages
3261@cindex codepages
3262The @code{recode} program provides various IBM or Microsoft code pages
3263(@pxref{Tabular}).  An easy way to find them all at once out of the
3264@code{recode} program itself is through the command:
3265
3266@example
3267recode -l | egrep -i '(CP|IBM)[0-9]'
3268@end example
3269
3270@noindent
3271But also, see few special charsets presented in the incoming sections.
3272
3273@menu
3274* EBCDIC::              EBCDIC codes
3275* IBM-PC::              IBM's PC code
3276* Icon-QNX::            Unisys' Icon code
3277@end menu
3278
3279@node EBCDIC, IBM-PC, IBM and MS, IBM and MS
3280@section EBCDIC code
3281
3282@cindex EBCDIC charsets
3283This charset is the IBM's External Binary Coded Decimal for Interchange
3284Coding.  This is an eight bits code.  The following three variants were
3285implemented in @code{recode} independently of @w{RFC 1345}:
3286
3287@table @code
3288@item EBCDIC
3289@tindex EBCDIC@r{, a charset}
3290In @code{recode}, the @code{us..ebcdic} conversion is identical to @samp{dd
3291conv=ebcdic} conversion, and @code{recode} @code{ebcdic..us} conversion is
3292identical to @samp{dd conv=ascii} conversion.  This charset also represents
3293the way Control Data Corporation relates EBCDIC to 8-bits ASCII.
3294
3295@item EBCDIC-CCC
3296@tindex EBCDIC-CCC
3297In @code{recode}, the @code{us..ebcdic-ccc} or @code{ebcdic-ccc..us}
3298conversions represent the way Concurrent Computer Corporation (formerly
3299Perkin Elmer) relates EBCDIC to 8-bits ASCII.
3300
3301@item EBCDIC-IBM
3302@tindex EBCDIC-IBM
3303In @code{recode}, the @code{us..ebcdic-ibm} conversion is @emph{almost}
3304identical to the GNU @samp{dd conv=ibm} conversion.  Given the exact
3305@samp{dd conv=ibm} conversion table, @code{recode} once said:
3306
3307@example
3308Codes  91 and 213 both recode to 173
3309Codes  93 and 229 both recode to 189
3310No character recodes to  74
3311No character recodes to 106
3312@end example
3313
3314So I arbitrarily chose to recode 213 by 74 and 229 by 106.  This makes the
3315@code{EBCDIC-IBM} recoding reversible, but this is not necessarily the best
3316correction.  In any case, I think that GNU @code{dd} should be amended.
3317@code{dd} and @code{recode} should ideally agree on the same correction.
3318So, this table might change once again.
3319@end table
3320
3321@w{RFC 1345} brings into @code{recode} 15 other EBCDIC charsets, and 21 other
3322charsets having EBCDIC in at least one of their alias names.  You can
3323get a list of all these by executing:
3324
3325@example
3326recode -l | grep -i ebcdic
3327@end example
3328
3329Note that @code{recode} may convert a pure stream of EBCDIC characters,
3330but it does not know how to handle binary data between records which
3331is sometimes used to delimit them and build physical blocks.  If end of
3332lines are not marked, fixed record size may produce something readable,
3333but @code{VB} or @code{VBS} blocking is likely to yield some garbage in
3334the converted results.
3335
3336@node IBM-PC, Icon-QNX, EBCDIC, IBM and MS
3337@section IBM's PC code
3338
3339@tindex IBM-PC
3340@cindex MS-DOS charsets
3341@tindex MSDOS
3342@tindex dos
3343@tindex pc
3344This charset is available in @code{recode} under the name @code{IBM-PC},
3345with @code{dos}, @code{MSDOS} and @code{pc} as acceptable aliases.
3346The shortest way of specifying it in @code{recode} is @code{pc}.
3347
3348The charset is aimed towards a PC microcomputer from IBM or any compatible.
3349This is an eight-bit code.  This charset is fairly old in @code{recode},
3350its tables were produced a long while ago by mere inspection of a printed
3351chart of the IBM-PC codes and glyph.
3352
3353It has @code{CR-LF} as its implied surface.  This means that, if the original
3354end of lines have to be preserved while going out of @code{IBM-PC}, they
3355should currently be added back through the usage of a surface on the other
3356charset, or better, just never removed.  Here are examples for both cases:
3357
3358@example
3359recode pc..l2/cl < @var{input} > @var{output}
3360recode pc/..l2 < @var{input} > @var{output}
3361@end example
3362
3363@w{RFC 1345} brings into @code{recode} 44 @samp{IBM} charsets or code pages,
3364and also 8 other code pages.  You can get a list of these all these by
3365executing:@footnote{On DOS/Windows, stock shells do not know that apostrophes
3366quote special characters like @kbd{|}, so one need to use double quotes
3367instead of apostrophes.}
3368
3369@example
3370recode -l | egrep -i '(CP|IBM)[0-9]'
3371@end example
3372
3373@noindent
3374@cindex CR-LF surface, in IBM-PC charsets
3375@tindex IBM819@r{, and CR-LF surface}
3376All charset or aliases beginning with letters @samp{CP} or @samp{IBM}
3377also have @code{CR-LF} as their implied surface.  The same is true for a
3378purely numeric alias in the same family.  For example, all of @code{819},
3379@code{CP819} and @code{IBM819} imply @code{CR-LF} as a surface.  Note that
3380@code{ISO-8859-1} does @emph{not} imply a surface, despite it shares the
3381same tabular data as @code{819}.
3382
3383@tindex ibm437
3384There are a few discrepancies between this @code{IBM-PC} charset and the
3385very similar @w{RFC 1345} charset @code{ibm437}, which have not been analysed
3386yet, so the charsets are being kept separate for now.  This might change in
3387the future, and the @code{IBM-PC} charset might disappear.  Wizards would
3388be interested in comparing the output of these two commands:
3389
3390@example
3391recode -vh IBM-PC..Latin-1
3392recode -vh IBM437..Latin-1
3393@end example
3394
3395@noindent
3396The first command uses the charset prior to @w{RFC 1345} introduction.
3397Both methods give different recodings.  These differences are annoying,
3398the fuzziness will have to be explained and settle down one day.
3399
3400@node Icon-QNX,  , IBM-PC, IBM and MS
3401@section Unisys' Icon code
3402
3403@tindex Icon-QNX@r{, and aliases}
3404@tindex QNX@r{, an alias for a charset}
3405This charset is available in @code{recode} under the name
3406@code{Icon-QNX}, with @code{QNX} as an acceptable alias.
3407
3408The file is using Unisys' Icon way to represent diacritics with code 25
3409escape sequences, under the system QNX.  This is a seven-bit code, even
3410if eight-bit codes can flow through as part of IBM-PC charset.
3411
3412@node CDC, Micros, IBM and MS, Top
3413@chapter Charsets for CDC machines
3414
3415@cindex CDC charsets
3416@cindex charsets for CDC machines
3417What is now @code{recode} evolved out, through many transformations
3418really, from a set of programs which were originally written in
3419@dfn{COMPASS}, Control Data Corporation's assembler, with bits in FORTRAN,
3420and later rewritten in CDC 6000 Pascal.  The CDC heritage shows by the
3421fact some old CDC charsets are still supported.
3422
3423The @code{recode} author used to be familiar with CDC Scope-NOS/BE and
3424Kronos-NOS, and many CDC formats.  Reading CDC tapes directly on other
3425machines is often a challenge, and @code{recode} does not always solve
3426it.  It helps having tapes created in coded mode instead of binary mode,
3427and using @code{S} (Stranger) tapes instead of @code{I} (Internal) tapes.
3428ANSI labels and multi-file tapes might be the source of trouble.  There are
3429ways to handle a few Cyber Record Manager formats, but some of them might
3430be quite difficult to decode properly after the transfer is done.
3431
3432The @code{recode} program is usable only for a small subset of NOS text
3433formats, and surely not with binary textual formats, like @code{UPDATE}
3434or @code{MODIFY} sources, for example.  @code{recode} is not especially
3435suited for reading 8/12 or 56/60 packing, yet this could easily arranged
3436if there was a demand for it.  It does not have the ability to translate
3437Display Code directly, as the ASCII conversion implied by tape drivers
3438or FTP does the initial approximation.  @code{recode} can decode 6/12
3439caret notation over Display Code already mapped to ASCII.
3440
3441@menu
3442* Display Code::        Control Data's Display Code
3443* CDC-NOS::             ASCII 6/12 from NOS
3444* Bang-Bang::           ASCII ``bang bang''
3445@end menu
3446
3447@node Display Code, CDC-NOS, CDC, CDC
3448@section Control Data's Display Code
3449
3450@cindex CDC Display Code, a table
3451This code is not available in @code{recode}, but repeated here for
3452reference.  This is a 6-bit code used on CDC mainframes.
3453
3454@example
3455Octal display code to graphic       Octal display code to octal ASCII
3456
345700  :    20  P    40  5   60  #     00 072  20 120  40 065  60 043
345801  A    21  Q    41  6   61  [     01 101  21 121  41 066  61 133
345902  B    22  R    42  7   62  ]     02 102  22 122  42 067  62 135
346003  C    23  S    43  8   63  %     03 103  23 123  43 070  63 045
346104  D    24  T    44  9   64  "     04 104  24 124  44 071  64 042
346205  E    25  U    45  +   65  _     05 105  25 125  45 053  65 137
346306  F    26  V    46  -   66  !     06 106  26 126  46 055  66 041
346407  G    27  W    47  *   67  &     07 107  27 127  47 052  67 046
346510  H    30  X    50  /   70  '     10 110  30 130  50 057  70 047
346611  I    31  Y    51  (   71  ?     11 111  31 131  51 050  71 077
346712  J    32  Z    52  )   72  <     12 112  32 132  52 051  72 074
346813  K    33  0    53  $   73  >     13 113  33 060  53 044  73 076
346914  L    34  1    54  =   74  @@     14 114  34 061  54 075  74 100
347015  M    35  2    55      75  \     15 115  35 062  55 040  75 134
347116  N    36  3    56  ,   76  ^     16 116  36 063  56 054  76 136
347217  O    37  4    57  .   77  ;     17 117  37 064  57 056  77 073
3473@end example
3474
3475In older times, @kbd{:} used octal 63, and octal 0 was not a character.
3476The table above shows the ASCII glyph interpretation of codes 60 to 77,
3477yet these 16 codes were once defined differently.
3478
3479There is no explicit end of line in Display Code, and the Cyber Record
3480Manager introduced many new ways to represent them, the traditional end of
3481lines being reachable by setting @code{RT} to @samp{Z}.  If 6-bit bytes
3482in a file are sequentially counted from 1, a traditional end of line
3483does exist if bytes 10*@var{n}+9 and 10@var{n}+10 are both zero for a
3484given @var{n}, in which case these two bytes are not to be interpreted as
3485@kbd{::}.  Also, up to 9 immediately preceeding zero bytes, going backward,
3486are to be considered as part of the end of line and not interpreted as
3487@kbd{:}@footnote{This convention replaced an older one saying that up to 4
3488immediately preceeding @emph{pairs} of zero bytes, going backward, are to
3489be considered as part of the end of line and not interpreted as @kbd{::}.}.
3490
3491@node CDC-NOS, Bang-Bang, Display Code, CDC
3492@section ASCII 6/12 from NOS
3493
3494@tindex CDC-NOS@r{, and its aliases}
3495@tindex NOS
3496This charset is available in @code{recode} under the name
3497@code{CDC-NOS}, with @code{NOS} as an acceptable alias.
3498
3499@cindex NOS 6/12 code
3500@cindex caret ASCII code
3501This is one of the charsets in use on CDC Cyber NOS systems to represent
3502ASCII, sometimes named @dfn{NOS 6/12} code for coding ASCII.  This code is
3503also known as @dfn{caret ASCII}.  It is based on a six bits character set
3504in which small letters and control characters are coded using a @kbd{^}
3505escape and, sometimes, a @kbd{@@} escape.
3506
3507The routines given here presume that the six bits code is already expressed
3508in ASCII by the communication channel, with embedded ASCII @kbd{^} and
3509@kbd{@@} escapes.
3510
3511Here is a table showing which characters are being used to encode each
3512ASCII character.
3513
3514@example
3515000  ^5  020  ^#  040     060  0  100 @@A  120  P  140  @@G  160  ^P
3516001  ^6  021  ^[  041  !  061  1  101  A  121  Q  141  ^A  161  ^Q
3517002  ^7  022  ^]  042  "  062  2  102  B  122  R  142  ^B  162  ^R
3518003  ^8  023  ^%  043  #  063  3  103  C  123  S  143  ^C  163  ^S
3519004  ^9  024  ^"  044  $  064  4  104  D  124  T  144  ^D  164  ^T
3520005  ^+  025  ^_  045  %  065  5  105  E  125  U  145  ^E  165  ^U
3521006  ^-  026  ^!  046  &  066  6  106  F  126  V  146  ^F  166  ^V
3522007  ^*  027  ^&  047  '  067  7  107  G  127  W  147  ^G  167  ^W
3523010  ^/  030  ^'  050  (  070  8  110  H  130  X  150  ^H  170  ^X
3524011  ^(  031  ^?  051  )  071  9  111  I  131  Y  151  ^I  171  ^Y
3525012  ^)  032  ^<  052  *  072 @@D  112  J  132  Z  152  ^J  172  ^Z
3526013  ^$  033  ^>  053  +  073  ;  113  K  133  [  153  ^K  173  ^0
3527014  ^=  034  ^@@  054  ,  074  <  114  L  134  \  154  ^L  174  ^1
3528015  ^   035  ^\  055  -  075  =  115  M  135  ]  155  ^M  175  ^2
3529016  ^,  036  ^^  056  .  076  >  116  N  136 @@B  156  ^N  176  ^3
3530017  ^.  037  ^;  057  /  077  ?  117  O  137  _  157  ^O  177  ^4
3531@end example
3532
3533@node Bang-Bang,  , CDC-NOS, CDC
3534@section ASCII ``bang bang''
3535
3536@tindex Bang-Bang
3537This charset is available in @code{recode} under the name @code{Bang-Bang}.
3538
3539This code, in use on Cybers at Universit@'e de Montr@'eal mainly, served
3540to code a lot of French texts.  The original name of this charset is
3541@dfn{ASCII cod@'e Display}.  This code is also known as @dfn{Bang-bang}.
3542It is based on a six bits character set in which capitals, French
3543diacritics and a few others are coded using an @kbd{!} escape followed
3544by a single character, and control characters using a double @kbd{!}
3545escape followed by a single character.
3546
3547The routines given here presume that the six bits code is already expressed
3548in ASCII by the communication channel, with embedded ASCII @kbd{!}
3549escapes.
3550
3551Here is a table showing which characters are being used to encode each
3552ASCII character.
3553
3554@example
3555000 !!@@  020 !!P  040    060 0  100 @@   120 !P  140 !@@ 160 P
3556001 !!A  021 !!Q  041 !" 061 1  101 !A  121 !Q  141 A  161 Q
3557002 !!B  022 !!R  042 "  062 2  102 !B  122 !R  142 B  162 R
3558003 !!C  023 !!S  043 #  063 3  103 !C  123 !S  143 C  163 S
3559004 !!D  024 !!T  044 $  064 4  104 !D  124 !T  144 D  164 T
3560005 !!E  025 !!U  045 %  065 5  105 !E  125 !U  145 E  165 U
3561006 !!F  026 !!V  046 &  066 6  106 !F  126 !V  146 F  166 V
3562007 !!G  027 !!W  047 '  067 7  107 !G  127 !W  147 G  167 W
3563010 !!H  030 !!X  050 (  070 8  110 !H  130 !X  150 H  170 X
3564011 !!I  031 !!Y  051 )  071 9  111 !I  131 !Y  151 I  171 Y
3565012 !!J  032 !!Z  052 *  072 :  112 !J  132 !Z  152 J  172 Z
3566013 !!K  033 !![  053 +  073 ;  113 !K  133 [   153 K  173 ![
3567014 !!L  034 !!\  054 ,  074 <  114 !L  134 \   154 L  174 !\
3568015 !!M  035 !!]  055 -  075 =  115 !M  135 ]   155 M  175 !]
3569016 !!N  036 !!^  056 .  076 >  116 !N  136 ^   156 N  176 !^
3570017 !!O  037 !!_  057 /  077 ?  117 !O  137 _   157 O  177 !_
3571@end example
3572
3573@node Micros, Miscellaneous, CDC, Top
3574@chapter Other micro-computer charsets
3575
3576@cindex NeXT charsets
3577The @code{NeXT} charset, which used to be especially provided in releases of
3578@code{recode} before 3.5, has been integrated since as one @w{RFC 1345} table.
3579
3580@menu
3581* Apple-Mac::           Apple's Macintosh code
3582* AtariST::             Atari ST code
3583@end menu
3584
3585@node Apple-Mac, AtariST, Micros, Micros
3586@section Apple's Macintosh code
3587
3588@tindex Apple-Mac
3589@cindex Macintosh charset
3590This charset is available in @code{recode} under the name @code{Apple-Mac}.
3591The shortest way of specifying it in @code{recode} is @code{ap}.
3592
3593The charset is aimed towards a Macintosh micro-computer from Apple.
3594This is an eight bit code.  The file is the data fork only.  This charset
3595is fairly old in @code{recode}, its tables were produced a long while ago
3596by mere inspection of a printed chart of the Macintosh codes and glyph.
3597
3598@cindex CR surface, in Macintosh charsets
3599It has @code{CR} as its implied surface.  This means that, if the original
3600end of lines have to be preserved while going out of @code{Apple-Mac}, they
3601should currently be added back through the usage of a surface on the other
3602charset, or better, just never removed.  Here are examples for both cases:
3603
3604@example
3605recode ap..l2/cr < @var{input} > @var{output}
3606recode ap/..l2 < @var{input} > @var{output}
3607@end example
3608
3609@w{RFC 1345} brings into @code{recode} 2 other Macintosh charsets.  You can
3610discover them by using @code{grep} over the output of @samp{recode -l}:
3611
3612@example
3613recode -l | grep -i mac
3614@end example
3615
3616@noindent
3617@tindex macintosh@r{, a charset, and its aliases}
3618@tindex macintosh_ce@r{, and its aliases}
3619@tindex mac
3620@tindex macce
3621Charsets @code{macintosh} and @code{macintosh_ce}, as well as their aliases
3622@code{mac} and @code{macce} also have @code{CR} as their implied surface.
3623
3624There are a few discrepancies between the @code{Apple-Mac} charset and
3625the very similar @w{RFC 1345} charset @code{macintosh}, which have not been
3626analysed yet, so the charsets are being kept separate for now.  This might
3627change in the future, and the @code{Apple-Mac} charset might disappear.
3628Wizards would be interested in comparing the output of these two commands:
3629
3630@example
3631recode -vh Apple-Mac..Latin-1
3632recode -vh macintosh..Latin-1
3633@end example
3634
3635@noindent
3636The first command use the charset prior to @w{RFC 1345} introduction.
3637Both methods give different recodings.  These differences are annoying,
3638the fuzziness will have to be explained and settle down one day.
3639
3640@cindex @code{recode}, a Macintosh port
3641As a side note, some people ask if there is a Macintosh port of the
3642@code{recode} program.  I'm not aware of any.  I presume that if the tool
3643fills a need for Macintosh users, someone will port it one of these days?
3644
3645@node AtariST,  , Apple-Mac, Micros
3646@section Atari ST code
3647
3648@tindex AtariST
3649This charset is available in @code{recode} under the name @code{AtariST}.
3650
3651This is the character set used on the Atari ST/TT/Falcon.  This is similar
3652to @code{IBM-PC}, but differs in some details: it includes some more accented
3653characters, the graphic characters are mostly replaced by Hebrew characters,
3654and there is a true German @kbd{sharp s} different from Greek @kbd{beta}.
3655
3656About the end-of-line conversions: the canonical end-of-line on the
3657Atari is @samp{\r\n}, but unlike @code{IBM-PC}, the OS makes no
3658difference between text and binary input/output; it is up to the
3659application how to interpret the data.  In fact, most of the libraries
3660that come with compilers can grok both @samp{\r\n} and @samp{\n} as end
3661of lines.  Many of the users who also have access to Unix systems prefer
3662@samp{\n} to ease porting Unix utilities.  So, for easing reversibility,
3663@code{recode} tries to let @samp{\r} undisturbed through recodings.
3664
3665@node Miscellaneous, Surfaces, Micros, Top
3666@chapter Various other charsets
3667
3668Even if these charsets were originally added to @code{recode} for
3669handling texts written in French, they find other uses.  We did use them
3670a lot for writing French diacriticised texts in the past, so @code{recode}
3671knows how to handle these particularly well for French texts.
3672
3673@menu
3674* HTML::                World Wide Web representations
3675* LaTeX::               LaTeX macro calls
3676* Texinfo::             GNU project documentation files
3677* Vietnamese::
3678* African::             African charsets
3679* Others::
3680* Texte::               Easy French conventions
3681* Mule::                Mule as a multiplexed charset
3682@end menu
3683
3684@node HTML, LaTeX, Miscellaneous, Miscellaneous
3685@section World Wide Web representations
3686
3687@cindex HTML
3688@cindex SGML
3689@cindex XML
3690@cindex Web
3691@cindex World Wide Web
3692@cindex WWW
3693@cindex markup language
3694@cindex entities
3695@cindex character entities
3696@cindex character entity references
3697@cindex numeric character references
3698Character entities have been introduced by SGML and made widely popular
3699through HTML, the markup language in use for the World Wide Web, or Web or
3700WWW for short.  For representing @emph{unusual} characters, HTML texts use
3701special sequences, beginning with an ampersand @kbd{&} and ending with a
3702semicolon @kbd{;}.  The sequence may itself start with a number sigh @kbd{#}
3703and be followed by digits, so forming a @dfn{numeric character reference},
3704or else be an alphabetic identifier, so forming a @dfn{character entity
3705reference}.
3706
3707The HTML standards have been revised into different HTML levels over time,
3708and the list of allowable character entities differ in them.  The later XML,
3709meant to simplify many things, has an option (@samp{standalone=yes}) which
3710much restricts that list.  The @code{recode} library is able to convert
3711character references between their mnemonic form and their numeric form,
3712depending on aimed HTML standard level.  It also can, of course, convert
3713between HTML and various other charsets.
3714
3715Here is a list of those HTML variants which @code{recode} supports.
3716Some notes have been provided by Francois Yergeau @email{yergeau@@alis.com}.
3717
3718@table @code
3719@item XML-standalone
3720@tindex h0
3721@tindex XML-standalone
3722This charset is available in @code{recode} under the name
3723@code{XML-standalone}, with @code{h0} as an acceptable alias.  It is
3724documented in section 4.1 of @uref{http://www.w3.org/TR/REC-xml}.
3725It only knows @samp{&amp;}, @samp{&gt;}, @samp{&lt;}, @samp{&quot;}
3726and @samp{&apos;}.
3727
3728@item HTML_1.1
3729@tindex HTML_1.1
3730@tindex h1
3731This charset is available in @code{recode} under the name @code{HTML_1.1},
3732with @code{h1} as an acceptable alias.  HTML 1.0 was never really documented.
3733
3734@item HTML_2.0
3735@tindex HTML_2.0
3736@tindex RFC1866
3737@tindex 1866
3738@tindex h2
3739This charset is available in @code{recode} under the name @code{HTML_2.0},
3740and has @code{RFC1866}, @code{1866} and @code{h2} for aliases.  HTML 2.0
3741entities are listed in @w{RFC 1866}.  Basically, there is an entity for
3742each @emph{alphabetical} character in the right part of @w{ISO 8859-1}.
3743In addition, there are four entities for syntax-significant ASCII characters:
3744@samp{&amp;}, @samp{&gt;}, @samp{&lt;} and @samp{&quot;}.
3745
3746@item HTML-i18n
3747@tindex HTML-i18n
3748@tindex RFC2070
3749@tindex 2070
3750This charset is available in @code{recode} under the name
3751@code{HTML-i18n}, and has @code{RFC2070} and @code{2070} for
3752aliases.  @w{RFC 2070} added entities to cover the whole right
3753part of @w{ISO 8859-1}.  The list is conveniently accessible at
3754@uref{http://www.alis.com:8085/ietf/html/html-latin1.sgml}.  In addition,
3755four i18n-related entities were added: @samp{&zwnj;} (@samp{&#8204;}),
3756@samp{&zwj;} (@samp{&#8205;}), @samp{&lrm;} (@samp{&#8206}) and @samp{&rlm;}
3757(@samp{&#8207;}).
3758
3759@item HTML_3.2
3760@tindex HTML_3.2
3761@tindex h3
3762This charset is available in @code{recode} under the name
3763@code{HTML_3.2}, with @code{h3} as an acceptable alias.
3764@uref{http://www.w3.org/TR/REC-html32.html, HTML 3.2} took up the full
3765@w{Latin-1} list but not the i18n-related entities from @w{RFC 2070}.
3766
3767@item HTML_4.0
3768@tindex h4
3769@tindex h
3770This charset is available in @code{recode} under the name @code{HTML_4.0},
3771and has @code{h4} and @code{h} for aliases.  Beware that the particular
3772alias @code{h} is not @emph{tied} to HTML 4.0, but to the highest HTML
3773level supported by @code{recode}; so it might later represent HTML level
37745 if this is ever created.  @uref{http://www.w3.org/TR/REC-html40/,
3775HTML 4.0} has the whole @w{Latin-1} list, a set of entities for
3776symbols, mathematical symbols, and Greek letters, and another set for
3777markup-significant and internationalization characters comprising the
37784 ASCII entities, the 4 i18n-related from @w{RFC 2070} plus some more.
3779See @uref{http://www.w3.org/TR/REC-html40/sgml/entities.html}.
3780
3781@end table
3782
3783Printable characters from @w{Latin-1} may be used directly in an HTML text.
3784However, partly because people have deficient keyboards, partly because
3785people want to transmit HTML texts over non 8-bit clean channels while not
3786using MIME, it is common (yet debatable) to use character entity references
3787even for @w{Latin-1} characters, when they fall outside ASCII (that is,
3788when they have the 8th bit set).
3789
3790When you recode from another charset to @code{HTML}, beware that all
3791occurrences of double quotes, ampersands, and left or right angle brackets
3792are translated into special sequences.  However, in practice, people often
3793use ampersands and angle brackets in the other charset for introducing
3794HTML commands, compromising it: it is not pure HTML, not it is pure
3795other charset.  These particular translations can be rather inconvenient,
3796they may be specifically inhibited through the command option @samp{-d}
3797(@pxref{Mixed}).
3798
3799Codes not having a mnemonic entity are output by @code{recode} using the
3800@samp{&#@var{nnn};} notation, where @var{nnn} is a decimal representation
3801of the UCS code value.  When there is an entity name for a character, it
3802is always preferred over a numeric character reference.  ASCII printable
3803characters are always generated directly.  So is the newline.  While reading
3804HTML, @code{recode} supports numeric character reference as alternate
3805writings, even when written as hexadecimal numbers, as in @samp{&#xfffd}.
3806This is documented in:
3807
3808@example
3809http://www.w3.org/TR/REC-html40/intro/sgmltut.html#h-3.2.3
3810@end example
3811
3812When @code{recode} translates to HTML, the translation occurs according to
3813the HTML level as selected by the goal charset.  When translating @emph{from}
3814HTML, @code{recode} not only accepts the character entity references known at
3815that level, but also those of all other levels, as well as a few alternative
3816special sequences, to be forgiving to files using other HTML standards.
3817
3818@cindex normilise an HTML file
3819@cindex HTML normalization
3820The @code{recode} program can be used to @emph{normalise} an HTML file using
3821oldish conventions.  For example, it accepts @samp{&AE;}, as this once was a
3822valid writing, somewhere.  However, it should always produce @samp{&AElig;}
3823instead of @samp{&AE;}.  Yet, this is not completely true.  If one does:
3824
3825@example
3826recode h3..h3 < @var{input}
3827@end example
3828
3829@noindent
3830the operation will be optimised into a mere copy, and you can get @samp{&AE;}
3831this way, if you had some in your input file.  But if you explicitly defeat
3832the optimisation, like this maybe:
3833
3834@example
3835recode h3..u2,u2..h3 < @var{input}
3836@end example
3837
3838@noindent
3839then @samp{&AE;} should be normalised into @samp{&AElig;} by the operation.
3840
3841@node LaTeX, Texinfo, HTML, Miscellaneous
3842@section La@TeX{} macro calls
3843
3844@tindex LaTeX@r{, a charset}
3845@tindex ltex
3846@cindex La@TeX{} files
3847@cindex @TeX{} files
3848This charset is available in @code{recode} under the name @code{LaTeX}
3849and has @code{ltex} as an alias.  It is used for ASCII files coded to be
3850read by La@TeX{} or, in certain cases, by @TeX{}.
3851
3852Whenever you recode from another charset to @code{LaTeX}, beware that all
3853occurrences of backslashes @kbd{\} are translated into the string
3854@samp{\backslash@{@}}.  However, in practice, people often use backslashes
3855in the other charset for introducing @TeX{} commands, compromising it:
3856it is not pure @TeX{}, nor it is pure other charset.  This translation
3857of backslashes into @samp{\backslash@{@}} can be rather inconvenient,
3858it may be inhibited through the command option @samp{-d} (@pxref{Mixed}).
3859
3860@node Texinfo, Vietnamese, LaTeX, Miscellaneous
3861@section GNU project documentation files
3862
3863@tindex Texinfo@r{, a charset}
3864@tindex texi
3865@tindex ti
3866@cindex Texinfo files
3867This charset is available in @code{recode} under the name @code{Texinfo}
3868and has @code{texi} and @code{ti} for aliases.  It is used by the GNU
3869project for its documentation.  Texinfo files may be converted into Info
3870files by the @code{makeinfo} program and into nice printed manuals by
3871the @TeX{} system.
3872
3873Even if @code{recode} may transform other charsets to Texinfo, it may
3874not read Texinfo files yet.  In these times, usages are also changing
3875between versions of Texinfo, and @code{recode} only partially succeeds
3876in correctly following these changes.  So, for now, Texinfo support in
3877@code{recode} should be considered as work still in progress (!).
3878
3879@node Vietnamese, African, Texinfo, Miscellaneous
3880@section Vietnamese charsets
3881
3882@cindex Vietnamese charsets
3883We are currently experimenting the implementation, in @code{recode}, of a few
3884character sets and transliterated forms to handle the Vietnamese language.
3885They are quite briefly summarised, here.
3886
3887@table @code
3888@item TCVN
3889@tindex TCVN@r{, for Vienamese}
3890@tindex VN1@r{, maybe not available}
3891@tindex VN2@r{, maybe not available}
3892@tindex VN3@r{, maybe not available}
3893The TCVN charset has an incomplete name.  It might be one of the three
3894charset @code{VN1}, @code{VN2} or @code{VN3}.  Yes @code{VN2} might be a
3895second version of @code{VISCII}.  To be clarified.
3896
3897@item VISCII
3898@tindex VISCII
3899This is an 8-bit character set which seems to be rather popular for
3900writing Vietnamese.
3901
3902@item VPS
3903@tindex VPS
3904This is an 8-bit character set for Vietnamese.  No much reference.
3905
3906@item VIQR
3907@tindex VIQR
3908The VIQR convention is a 7-bit, @code{ASCII} transliteration for Vietnamese.
3909
3910@item VNI
3911@tindex VNI
3912The VNI convention is a 8-bit, @code{Latin-1} transliteration for Vietnamese.
3913@end table
3914
3915@tindex 1129@r{, not available}
3916@tindex CP1129@r{, not available}
3917@tindex 1258@r{, not available}
3918@tindex CP1258@r{, not available}
3919Still lacking for Vietnamese in @code{recode}, are the charsets @code{CP1129}
3920and @code{CP1258}.
3921
3922@node African, Others, Vietnamese, Miscellaneous
3923@section African charsets
3924
3925@cindex African charsets
3926Some African character sets are available for a few languages, when these
3927are heavily used in countries where French is also currently spoken.
3928
3929@tindex AFRFUL-102-BPI_OCIL@r{, and aliases}
3930@tindex bambara
3931@tindex bra
3932@tindex ewondo
3933@tindex fulfude
3934@tindex AFRFUL-103-BPI_OCIL@r{, and aliases}
3935@tindex t-bambara
3936@tindex t-bra
3937@tindex t-ewondo
3938@tindex t-fulfude
3939One African charset is usable for Bambara, Ewondo and Fulfude, as well
3940as for French.  This charset is available in @code{recode} under the name
3941@code{AFRFUL-102-BPI_OCIL}.  Accepted aliases are @code{bambara}, @code{bra},
3942@code{ewondo} and @code{fulfude}.  Transliterated forms of the same are
3943available under the name @code{AFRFUL-103-BPI_OCIL}.  Accepted aliases
3944are @code{t-bambara}, @code{t-bra}, @code{t-ewondo} and @code{t-fulfude}.
3945
3946@tindex AFRLIN-104-BPI_OCIL
3947@tindex lingala
3948@tindex lin
3949@tindex sango
3950@tindex wolof
3951@tindex AFRLIN-105-BPI_OCIL
3952@tindex t-lingala
3953@tindex t-lin
3954@tindex t-sango
3955@tindex t-wolof
3956Another African charset is usable for Lingala, Sango and Wolof, as well
3957as for French.  This charset is available in @code{recode} under the
3958name @code{AFRLIN-104-BPI_OCIL}.  Accepted aliases are @code{lingala},
3959@code{lin}, @code{sango} and @code{wolof}.  Transliterated forms of the same
3960are available under the name @code{AFRLIN-105-BPI_OCIL}.  Accepted aliases
3961are @code{t-lingala}, @code{t-lin}, @code{t-sango} and @code{t-wolof}.
3962
3963@tindex AFRL1-101-BPI_OCIL
3964@tindex t-francais
3965@tindex t-fra
3966To ease exchange with @code{ISO-8859-1}, there is a charset conveying
3967transliterated forms for @w{Latin-1} in a way which is compatible with the other
3968African charsets in this series.  This charset is available in @code{recode}
3969under the name @code{AFRL1-101-BPI_OCIL}.  Accepted aliases are @code{t-fra}
3970and @code{t-francais}.
3971
3972@node Others, Texte, African, Miscellaneous
3973@section Cyrillic and other charsets
3974
3975@cindex Cyrillic charsets
3976The following Cyrillic charsets are already available in @code{recode}
3977through @w{RFC 1345} tables: @code{CP1251} with aliases @code{1251}, @code{
3978ms-cyrl} and @code{windows-1251}; @code{CSN_369103} with aliases
3979@code{ISO-IR-139} and @code{KOI8_L2}; @code{ECMA-cyrillic} with aliases
3980@code{ECMA-113}, @code{ECMA-113:1986} and @code{iso-ir-111}, @code{IBM880}
3981with aliases @code{880}, @code{CP880} and @code{EBCDIC-Cyrillic};
3982@code{INIS-cyrillic} with alias @code{iso-ir-51}; @code{ISO-8859-5} with
3983aliases @code{cyrillic}, @code{ ISO-8859-5:1988} and @code{iso-ir-144};
3984@code{KOI-7}; @code{KOI-8} with alias @code{GOST_19768-74}; @code{KOI8-R};
3985@code{KOI8-RU} and finally @code{KOI8-U}.
3986
3987There seems to remain some confusion in Roman charsets for Cyrillic
3988languages, and because a few users requested it repeatedly, @code{recode}
3989now offers special services in that area.  Consider these charsets as
3990experimental and debatable, as the extraneous tables describing them are
3991still a bit fuzzy or non-standard.  Hopefully, in the long run, these
3992charsets will be covered in Keld Simonsen's works to the satisfaction of
3993everybody, and this section will merely disappear.
3994
3995@table @code
3996@item KEYBCS2
3997@tindex KEYBCS2
3998@tindex Kamenicky
3999This charset is available under the name @code{KEYBCS2}, with
4000@code{Kamenicky} as an accepted alias.
4001
4002@item CORK
4003@tindex CORK
4004@tindex T1
4005This charset is available under the name @code{CORK}, with @code{T1}
4006as an accepted alias.
4007
4008@item KOI-8_CS2
4009@tindex KOI-8_CS2
4010This charset is available under the name @code{KOI-8_CS2}.
4011@end table
4012
4013@node Texte, Mule, Others, Miscellaneous
4014@section Easy French conventions
4015
4016@tindex Texte
4017@tindex txte
4018This charset is available in @code{recode} under the name @code{Texte}
4019and has @code{txte} for an alias.  It is a seven bits code, identical
4020to @code{ASCII-BS}, save for French diacritics which are noted using a
4021slightly different convention.
4022
4023At text entry time, these conventions provide a little speed up.  At read
4024time, they slightly improve the readability over a few alternate ways
4025of coding diacritics.  Of course, it would better to have a specialised
4026keyboard to make direct eight bits entries and fonts for immediately
4027displaying eight bit ISO @w{Latin-1} characters.  But not everybody is so
4028fortunate.  In a few mailing environments, and sadly enough, it still
4029happens that the eight bit is often willing-fully destroyed.
4030
4031@cindex Easy French
4032Easy French has been in use in France for a while.  I only slightly
4033adapted it (the diaeresis option) to make it more comfortable to several
4034usages in Qu@'ebec originating from Universit@'e de Montr@'eal.  In fact,
4035the main problem for me was not to necessarily to invent Easy French, but
4036to recognise the ``best'' convention to use, (best is not being defined,
4037here) and to try to solve the main pitfalls associated with the selected
4038convention.  Shortly said, we have:
4039
4040@table @kbd
4041@item e'
4042for @kbd{e} (and some other vowels) with an acute accent,
4043@item e`
4044for @kbd{e} (and some other vowels) with a grave accent,
4045@item e^
4046for @kbd{e} (and some other vowels) with a circumflex accent,
4047@item e"
4048for @kbd{e} (and some other vowels) with a diaeresis,
4049@item c,
4050for @kbd{c} with a cedilla.
4051@end table
4052
4053@noindent
4054There is no attempt at expressing the @kbd{ae} and @kbd{oe} diphthongs.
4055French also uses tildes over @kbd{n} and @kbd{a}, but seldomly, and this
4056is not represented either.  In some countries, @kbd{:} is used instead
4057of @kbd{"} to mark diaeresis.  @code{recode} supports only one convention
4058per call, depending on the @samp{-c} option of the @code{recode} command.
4059French quotes (sometimes called ``angle quotes'') are noted the same way
4060English quotes are noted in @TeX{}, @emph{id est} by @kbd{``} and @kbd{''}.
4061No effort has been put to preserve Latin ligatures (@kbd{@ae{}}, @kbd{@oe{}})
4062which are representable in several other charsets.  So, these ligatures
4063may be lost through Easy French conventions.
4064
4065The convention is prone to losing information, because the diacritic
4066meaning overloads some characters that already have other uses.  To
4067alleviate this, some knowledge of the French language is boosted into
4068the recognition routines.  So, the following subtleties are systematically
4069obeyed by the various recognisers.
4070
4071@enumerate
4072@item
4073A comma which follows a @kbd{c} is interpreted as a cedilla only if it is
4074followed by one of the vowels @kbd{a}, @kbd{o} or @kbd{u}.
4075
4076@item
4077A single quote which follows a @kbd{e} does not necessarily means an acute
4078accent if it is followed by a single other one.  For example:
4079
4080@table @kbd
4081@item e'
4082will give an @kbd{e} with an acute accent.
4083@item e''
4084will give a simple @kbd{e}, with a closing quotation mark.
4085@item e'''
4086will give an @kbd{e} with an acute accent, followed by a closing quotation
4087mark.
4088@end table
4089
4090There is a problem induced by this convention if there are English
4091quotations with a French text.  In sentences like:
4092
4093@example
4094There's a meeting at Archie's restaurant.
4095@end example
4096
4097the single quotes will be mistaken twice for acute accents.  So English
4098contractions and suffix possessives could be mangled.
4099
4100@item
4101A double quote or colon, depending on @samp{-c} option, which follows a
4102vowel is interpreted as diaeresis only if it is followed by another letter.
4103But there are in French several words that @emph{end} with a diaeresis,
4104and the @code{recode} library is aware of them.  There are words ending in
4105``igue'', either feminine words without a relative masculine (besaigu@"e
4106and cigu@"e), or feminine words with a relative masculine@footnote{There
4107are supposed to be seven words in this case.  So, one is missing.}
4108(aigu@"e, ambigu@"e, contigu@"e, exigu@"e, subaigu@"e and suraigu@"e).
4109There are also words not ending in ``igue'', but instead, either ending by
4110``i''@footnote{Look at one of the following sentences (the second has to
4111be interpreted with the @samp{-c} option):
4112
4113@example
4114"Ai"e!  Voici le proble`me que j'ai"
4115Ai:e!  Voici le proble`me que j'ai:
4116@end example
4117
4118There is an ambiguity between an
4119@tex
4120a\"\i,
4121@end tex
4122@ifinfo
4123ai",
4124@end ifinfo
4125@c FIXME: why not use @dotless{} here?  It works, AFAIK.
4126@ignore
4127a@"{@dotless{i}},
4128@end ignore
4129the small animal, and the indicative future of @emph{avoir} (first person
4130singular), when followed by what could be a diaeresis mark.  Hopefully,
4131the case is solved by the fact that an apostrophe always precedes the
4132verb and almost never the animal.}
4133@tex
4134(a\"\i, conga\"\i, go\"\i, ha\"\i ka\"\i, inou\"\i, sa\"\i, samura\"\i,
4135tha\"\i{} and toka\"\i),
4136@end tex
4137@ifinfo
4138(ai", congai", goi", hai"kai", inoui", sai", samurai", thai" and tokai"),
4139@end ifinfo
4140@ignore
4141(a@"{@dotless{i}}, conga@"{@dotless{i}}, go@"{@dotless{i}},
4142ha@"{@dotless{i}}ka@"{@dotless{i}}, inou@"{@dotless{i}}, sa@"{@dotless{i}},
4143samura@"{@dotless{i}}, tha@"{@dotless{i}} and toka@"{@dotless{i}}),
4144@end ignore
4145ending by ``e'' (cano@"e) or ending by ``u''@footnote{I did not pay
4146attention to proper nouns, but this one showed up as being fairly evident.}
4147(Esa@"u).
4148
4149Just to complete this topic, note that it would be wrong to make a rule
4150for all words ending in ``igue'' as needing a diaerisis, as there are
4151counter-examples (becfigue, b@`esigue, bigue, bordigue, bourdigue, brigue,
4152contre-digue, digue, d'intrigue, fatigue, figue, garrigue, gigue, igue,
4153intrigue, ligue, prodigue, sarigue and zigue).
4154@end enumerate
4155
4156@node Mule,  , Texte, Miscellaneous
4157@section Mule as a multiplexed charset
4158
4159@tindex Mule@r{, a charset}
4160@cindex multiplexed charsets
4161@cindex super-charsets
4162This version of @code{recode} barely starts supporting multiplexed or
4163super-charsets, that is, those encoding methods by which a single text
4164stream may contain a combination of more than one constituent charset.
4165The only multiplexed charset in @code{recode} is @code{Mule}, and even
4166then, it is only very partially implemented: the only correspondence
4167available is with @code{Latin-1}.  The author fastly implemented this
4168only because he needed this for himself.  However, it is intended that
4169Mule support to become more real in subsequent releases of @code{recode}.
4170
4171Multiplexed charsets are not to be confused with mixed charset texts
4172(@pxref{Mixed}).  For mixed charset input, the rules allowing to distinguish
4173which charset is current, at any given place, are kind of informal, and
4174driven from the semantics of what the file contains.  On the other side,
4175multiplexed charsets are @emph{designed} to be interpreted fairly precisely,
4176and quite independently of any informational context.
4177
4178@cindex MULE, in Emacs
4179The spelling @code{Mule} originally stands for @cite{@emph{mul}tilingual
4180@emph{e}nhancement to GNU Emacs}, it is the result of a collective
4181effort orchestrated by Handa Ken'ichi since 1993.  When @code{Mule} got
4182rewritten in the main development stream of GNU Emacs 20, the FSF renamed
4183it @code{MULE}, meaning @cite{@emph{mul}tilingual @emph{e}nvironment
4184in GNU Emacs}.  Even if the charset @code{Mule} is meant to stay
4185internal to GNU Emacs, it sometimes breaks loose in external files,
4186and as a consequence, a recoding tool is sometimes needed.  Within Emacs,
4187@code{Mule} comes with @code{Leim}, which stands for @cite{@emph{l}ibraries
4188of @emph{e}macs @emph{i}nput @emph{m}ethods}.  One of these libraries is
4189named @code{quail}@footnote{Usually, quail means quail egg in Japanese,
4190while egg alone is usually chicken egg.  Both quail egg and chicken
4191egg are popular food in Japan.  The @code{quail} input system has
4192been named because it is smaller that the previous @code{EGG} system.
4193As for @code{EGG}, it is the translation of @code{TAMAGO}.  This word
4194comes from the Japanese sentence @cite{@emph{ta}kusan @emph{ma}tasete
4195@emph{go}mennasai}, meaning @cite{sorry to have let you wait so long}.
4196Of course, the publication of @code{EGG} has been delayed many times@dots{}
4197(Story by Takahashi Naoto)}.
4198
4199@node Surfaces, Internals, Miscellaneous, Top
4200@chapter All about surfaces
4201@cindex surface, what it is
4202
4203@cindex trivial surface
4204The @dfn{trivial surface} consists of using a fixed number of bits
4205(often eight) for each character, the bits together hold the integer
4206value of the index for the character in its charset table.  There are
4207many kinds of surfaces, beyond the trivial one, all having the purpose
4208of increasing selected qualities for the storage or transmission.
4209For example, surfaces might increase the resistance to channel limits
4210(@code{Base64}), the transmission speed (@code{gzip}), the information
4211privacy (@code{DES}), the conformance to operating system conventions
4212(@code{CR-LF}), the blocking into records (@code{VB}), and surely other
4213things as well@footnote{These are mere examples to explain the concept,
4214@code{recode} only has @code{Base64} and @code{CR-LF}, actually.}.
4215Many surfaces may be applied to a stream of characters from a charset,
4216the order of application of surfaces is important, and surfaces
4217should be removed in the reverse order of their application.
4218
4219Even if surfaces may generally be applied to various charsets, some
4220surfaces were specifically designed for a particular charset, and would
4221not make much sense if applied to other charsets.  In such cases, these
4222conceptual surfaces have been implemented as @code{recode} charsets,
4223instead of as surfaces.  This choice yields to cleaner syntax
4224and usage.  @xref{Universal}.
4225
4226@cindex surfaces, implementation in @code{recode}
4227@tindex data@r{, a special charset}
4228@tindex tree@r{, a special charset}
4229Surfaces are implemented within @code{recode} as special charsets
4230which may only transform to or from the @code{data} or @code{tree}
4231special charsets.  Clever users may use this knowledge for writing
4232surface names in requests exactly as if they were pure charsets, when
4233the only need is to change surfaces without any kind of recoding between
4234real charsets.  In such contexts, either @code{data} or @code{tree} may
4235also be used as if it were some kind of generic, anonymous charset: the
4236request @samp{data..@var{surface}} merely adds the given @var{surface},
4237while the request @samp{@var{surface}..data} removes it.
4238
4239@cindex structural surfaces
4240@cindex surfaces, structural
4241@cindex surfaces, trees
4242The @code{recode} library distinguishes between mere data surfaces, and
4243structural surfaces, also called tree surfaces for short.  Structural
4244surfaces might allow, in the long run, transformations between a few
4245specialised representations of structural information like MIME parts,
4246Perl or Python initialisers, LISP S-expressions, XML, Emacs outlines, etc.
4247
4248We are still experimenting with surfaces in @code{recode}.  The concept opens
4249the doors to many avenues; it is not clear yet which ones are worth pursuing,
4250and which should be abandoned.  In particular, implementation of structural
4251surfaces is barely starting, there is not even a commitment that tree
4252surfaces will stay in @code{recode}, if they do prove to be more cumbersome
4253than useful.  This chapter presents all surfaces currently available.
4254
4255@menu
4256* Permutations::        Permuting groups of bytes
4257* End lines::           Representation for end of lines
4258* MIME::                MIME contents encodings
4259* Dump::                Interpreted character dumps
4260* Test::                Artificial data for testing
4261@end menu
4262
4263@node Permutations, End lines, Surfaces, Surfaces
4264@section Permuting groups of bytes
4265@cindex permutations of groups of bytes
4266
4267@cindex byte order swapping
4268@cindex endiannes, changing
4269A permutation is a surface transformation which reorders groups of
4270eight-bit bytes.  A @emph{21} permutation exchanges pairs of successive
4271bytes.  If the text contains an odd number of bytes, the last byte is
4272merely copied.  An @emph{4321} permutation inverts the order of quadruples
4273of bytes.  If the text does not contains a multiple of four bytes, the
4274remaining bytes are nevertheless permuted as @emph{321} if there are
4275three bytes, @emph{21} if there are two bytes, or merely copied otherwise.
4276
4277@table @code
4278@item 21
4279@tindex 21-Permutation
4280@tindex swabytes
4281This surface is available in @code{recode} under the name
4282@code{21-Permutation} and has @code{swabytes} for an alias.
4283
4284@item 4321
4285@tindex 4321-Permutation
4286This surface is available in @code{recode} under the name
4287@code{4321-Permutation}.
4288@end table
4289
4290@node End lines, MIME, Permutations, Surfaces
4291@section Representation for end of lines
4292@cindex end of line format
4293
4294The same charset might slightly differ, from one system to another, for
4295the single fact that end of lines are not represented identically on all
4296systems.  The representation for an end of line within @code{recode}
4297is the @code{ASCII} or @code{UCS} code with value 10, or @kbd{LF}.  Other
4298conventions for representing end of lines are available through surfaces.
4299
4300@table @code
4301@item CR
4302@tindex CR@r{, a surface}
4303This convention is popular on Apple's Macintosh machines.  When this
4304surface is applied, each line is terminated by @kbd{CR}, which has
4305@code{ASCII} value 13.  Unless the library is operating in strict mode,
4306adding or removing the surface will in fact @emph{exchange} @kbd{CR} and
4307@kbd{LF}, for better reversibility.  However, in strict mode, the exchange
4308does not happen, any @kbd{CR} will be copied verbatim while applying
4309the surface, and any @kbd{LF} will be copied verbatim while removing it.
4310
4311This surface is available in @code{recode} under the name @code{CR},
4312it does not have any aliases.  This is the implied surface for the Apple
4313Macintosh related charsets.
4314
4315@item CR-LF
4316@tindex CR-LF@r{, a surface}
4317This convention is popular on Microsoft systems running on IBM PCs and
4318compatible.  When this surface is applied, each line is terminated by
4319a sequence of two characters: one @kbd{CR} followed by one @kbd{LF},
4320in that order.
4321
4322@cindex Ctrl-Z, discarding
4323For compatibility with oldish MS-DOS systems, removing a @code{CR-LF}
4324surface will discard the first encountered @kbd{C-z}, which has
4325@code{ASCII} value 26, and everything following it in the text.
4326Adding this surface will not, however, append a @kbd{C-z} to the result.
4327
4328@tindex cl
4329This surface is available in @code{recode} under the name @code{CR-LF}
4330and has @code{cl} for an alias.  This is the implied surface for the IBM
4331or Microsoft related charsets or code pages.
4332@end table
4333
4334Some other charsets might have their own representation for an end of
4335line, which is different from @kbd{LF}.  For example, this is the case
4336of various @code{EBCDIC} charsets, or @code{Icon-QNX}.  The recoding of
4337end of lines is intimately tied into such charsets, it is not available
4338separately as surfaces.
4339
4340@node MIME, Dump, End lines, Surfaces
4341@section MIME contents encodings
4342@cindex MIME encodings
4343
4344@cindex RFC 2045
4345@w{RFC 2045} defines two 7-bit surfaces, meant to prepare 8-bit messages for
4346transmission.  Base64 is especially usable for binary entities, while
4347Quoted-Printable is especially usable for text entities, in those case
4348the lower 128 characters of the underlying charset coincide with ASCII.
4349
4350@table @code
4351@tindex Base64
4352@tindex b64
4353@tindex 64
4354@item Base64
4355This surface is available in @code{recode} under the name @code{Base64},
4356with @code{b64} and @code{64} as acceptable aliases.
4357
4358@item Quoted-Printable
4359@tindex Quoted-Printable
4360@tindex quote-printable
4361@tindex QP
4362This surface is available in @code{recode} under the name
4363@code{Quoted-Printable}, with @code{quote-printable} and @code{QP} as
4364acceptable aliases.
4365@end table
4366
4367Note that @code{UTF-7}, which may be also considered as a MIME surface,
4368is provided as a genuine charset instead, as it necessary relates to
4369@code{UCS-2} and nothing else.  @xref{UTF-7}.
4370
4371A little historical note, also showing the three levels of acceptance of
4372Internet standards.  MIME changed from a ``Proposed Standard'' (@w{RFC
43731341--1344}, 1992) to a ``Draft Standard'' (@w{RFC 1521--1523}) in 1993,
4374and was @emph{recycled} as a ``Draft Standard'' in 1996-11.  It is not yet a
4375``Full Standard''.
4376
4377@node Dump, Test, MIME, Surfaces
4378@section Interpreted character dumps
4379
4380@cindex dumping characters
4381Dumps are surfaces meant to express, in ways which are a bit more readable,
4382the bit patterns used to represent characters.  They allow the inspection
4383or debugging of character streams, but also, they may assist a bit the
4384production of C source code which, once compiled, would hold in memory a
4385copy of the original coding.  However, @code{recode} does not attempt, in
4386any way, to produce complete C source files in dumps.  User hand editing
4387or @file{Makefile} trickery is still needed for adding missing lines.
4388Dumps may be given in decimal, hexadecimal and octal, and be based over
4389chunks of either one, two or four eight-bit bytes.  Formatting has been
4390chosen to respect the C language syntax for number constants, with commas
4391and newlines inserted appropriately.
4392
4393However, when dumping two or four byte chunks, the last chunk may be
4394incomplete.  This is observable through the usage of narrower expression
4395for that last chunk only.  Such a shorter chunk would not be compiled
4396properly within a C initialiser, as all members of an array share a single
4397type, and so, have identical sizes.
4398
4399@table @code
4400@item Octal-1
4401@tindex Octal-1
4402@tindex o1
4403This surface corresponds to an octal expression of each input byte.
4404
4405It is available in @code{recode} under the name @code{Octal-1},
4406with @code{o1} and @code{o} as acceptable aliases.
4407
4408@item Octal-2
4409@tindex Octal-2
4410@tindex o2
4411This surface corresponds to an octal expression of each pair of
4412input bytes, except for the last pair, which may be short.
4413
4414It is available in @code{recode} under the name @code{Octal-2}
4415and has @code{o2} for an alias.
4416
4417@item Octal-4
4418@tindex Octal-4
4419@tindex o4
4420This surface corresponds to an octal expression of each quadruple of
4421input bytes, except for the last quadruple, which may be short.
4422
4423It is available in @code{recode} under the name @code{Octal-4}
4424and has @code{o4} for an alias.
4425
4426@item Decimal-1
4427@tindex Decimal-1
4428@tindex d1
4429This surface corresponds to an decimal expression of each input byte.
4430
4431It is available in @code{recode} under the name @code{Decimal-1},
4432with @code{d1} and @code{d} as acceptable aliases.
4433
4434@item Decimal-2
4435@tindex Decimal-2
4436@tindex d2
4437This surface corresponds to an decimal expression of each pair of
4438input bytes, except for the last pair, which may be short.
4439
4440It is available in @code{recode} under the name @code{Decimal-2}
4441and has @code{d2} for an alias.
4442
4443@item Decimal-4
4444@tindex Decimal-4
4445@tindex d4
4446This surface corresponds to an decimal expression of each quadruple of
4447input bytes, except for the last quadruple, which may be short.
4448
4449It is available in @code{recode} under the name @code{Decimal-4}
4450and has @code{d4} for an alias.
4451
4452@item Hexadecimal-1
4453@tindex Hexadecimal-1
4454@tindex x1
4455This surface corresponds to an hexadecimal expression of each input byte.
4456
4457It is available in @code{recode} under the name @code{Hexadecimal-1},
4458with @code{x1} and @code{x} as acceptable aliases.
4459
4460@item Hexadecimal-2
4461@tindex Hexadecimal-2
4462@tindex x2
4463This surface corresponds to an hexadecimal expression of each pair of
4464input bytes, except for the last pair, which may be short.
4465
4466It is available in @code{recode} under the name @code{Hexadecimal-2},
4467with @code{x2} for an alias.
4468
4469@item Hexadecimal-4
4470@tindex Hexadecimal-4
4471@tindex x4
4472This surface corresponds to an hexadecimal expression of each quadruple of
4473input bytes, except for the last quadruple, which may be short.
4474
4475It is available in @code{recode} under the name @code{Hexadecimal-4},
4476with @code{x4} for an alias.
4477@end table
4478
4479When removing a dump surface, that is, when reading a dump results back
4480into a sequence of bytes, the narrower expression for a short last chunk
4481is recognised, so dumping is a fully reversible operation.  However, in
4482case you want to produce dumps by other means than through @code{recode},
4483beware that for decimal dumps, the library has to rely on the number of
4484spaces to establish the original byte size of the chunk.
4485
4486Although the library might report reversibility errors, removing a dump
4487surface is a rather forgiving process: one may mix bases, group a variable
4488number of data per source line, or use shorter chunks in places other
4489than at the
4490far end.  Also, source lines not beginning with a number are skipped.  So,
4491@code{recode} should often be able to read a whole C header file, wrapping
4492the results of a previous dump, and regenerate the original byte string.
4493
4494@node Test,  , Dump, Surfaces
4495@section Artificial data for testing
4496
4497A few pseudo-surfaces exist to generate debugging data out of thin air.
4498These surfaces are only meant for the expert @code{recode} user, and are
4499only useful in a few contexts, like for generating binary permutations
4500from the recoding or acting on them.
4501
4502@cindex debugging surfaces
4503Debugging surfaces, @emph{when removed}, insert their generated data
4504at the beginning of the output stream, and copy all the input stream
4505after the generated data, unchanged.  This strange removal constraint
4506comes from the fact that debugging surfaces are usually specified in the
4507@emph{before} position instead of the @emph{after} position within a request.
4508With debugging surfaces, one often recodes file @file{/dev/null} in filter
4509mode.  Specifying many debugging surfaces at once has an accumulation
4510effect on the output, and since surfaces are removed from right to left,
4511each generating its data at the beginning of previous output, the net
4512effect is an @emph{impression} that debugging surfaces are generated from
4513left to right, each appending to the result of the previous.  In any case,
4514any real input data gets appended after what was generated.
4515
4516@table @code
4517@item test7
4518@tindex test7
4519When removed, this surface produces 128 single bytes, the first having
4520value 0, the second having value 1, and so forth until all 128 values have
4521been generated.
4522
4523@item test8
4524@tindex test8
4525When removed, this surface produces 256 single bytes, the first having
4526value 0, the second having value 1, and so forth until all 256 values have
4527been generated.
4528
4529@item test15
4530@tindex test15
4531When removed, this surface produces 64509 double bytes, the first having
4532value 0, the second having value 1, and so forth until all values have been
4533generated, but excluding risky @code{UCS-2} values, like all codes from
4534the surrogate @code{UCS-2} area (for @code{UTF-16}), the byte order mark,
4535and values known as invalid @code{UCS-2}.
4536
4537@item test16
4538@tindex test16
4539When removed, this surface produces 65536 double bytes, the first having
4540value 0, the second having value 1, and so forth until all 65536 values
4541have been generated.
4542@end table
4543
4544As an example, the command @samp{recode l5/test8..dump < /dev/null} is a
4545convoluted way to produce an output similar to @samp{recode -lf l5}.  It says
4546to generate all possible 256 bytes and interpret them as @code{ISO-8859-9}
4547codes, while converting them to @code{UCS-2}.  Resulting @code{UCS-2}
4548characters are dumped one per line, accompanied with their explicative name.
4549
4550@node Internals, Concept Index, Surfaces, Top
4551@chapter Internal aspects
4552
4553@cindex @code{recode} internals
4554@cindex internals
4555The incoming explanations of the internals of @code{recode} should
4556help people who want to dive into @code{recode} sources for adding new
4557charsets.  Adding new charsets does not require much knowledge about
4558the overall organisation of @code{recode}.  You can rather concentrate
4559of your new charset, letting the remainder of the @code{recode}
4560mechanics take care of interconnecting it with all others charsets.
4561
4562If you intend to play seriously at modifying @code{recode}, beware that
4563you may need some other GNU tools which were not required when you first
4564installing @code{recode}.  If you modify or create any @file{.l} file,
4565then you need Flex, and some better @code{awk} like @code{mawk},
4566GNU @code{awk}, or @code{nawk}.  If you modify the documentation (and
4567you should!), you need @code{makeinfo}.  If you are really audacious,
4568you may also want Perl for modifying tabular processing, then @code{m4},
4569Autoconf, Automake and @code{libtool} for adjusting configuration matters.
4570
4571@menu
4572* Main flow::           Overall organisation
4573* New charsets::        Adding new charsets
4574* New surfaces::        Adding new surfaces
4575* Design::              Comments on the library design
4576@end menu
4577
4578@node Main flow, New charsets, Internals, Internals
4579@section Overall organisation
4580@cindex @code{recode}, main flow of operation
4581
4582The @code{recode} mechanics slowly evolved for many years, and it
4583would be tedious to explain all problems I met and mistakes I did all
4584along, yielding the current behaviour.  Surely, one of the key choices
4585was to stop trying to do all conversions in memory, one line or one
4586buffer at a time.  It has been fruitful to use the character stream
4587paradigm, and the elementary recoding steps now convert a whole stream
4588to another.  Most of the control complexity in @code{recode} exists
4589so that each elementary recoding step stays simple, making easier
4590to add new ones.  The whole point of @code{recode}, as I see it, is
4591providing a comfortable nest for growing new charset conversions.
4592
4593@cindex single step
4594The main @code{recode} driver constructs, while initialising all
4595conversion modules, a table giving all the conversion routines
4596available (@dfn{single step}s) and for each, the starting charset and
4597the ending charset.  If we consider these charsets as being the nodes
4598of a directed graph, each single step may be considered as oriented
4599arc from one node to the other.  A cost is attributed to each arc:
4600for example, a high penalty is given to single steps which are prone
4601to losing characters, a lower penalty is given to those which need
4602studying more than one input character for producing an output
4603character, etc.
4604
4605Given a starting code and a goal code, @code{recode} computes the most
4606economical route through the elementary recodings, that is, the best
4607sequence of conversions that will transform the input charset into the
4608final charset.  To speed up execution, @code{recode} looks for
4609subsequences of conversions which are simple enough to be merged, and
4610then dynamically creates new single steps to represent these mergings.
4611
4612@cindex double step
4613A @dfn{double step} in @code{recode} is a special concept representing a
4614sequence of two single steps, the output of the first single step being the
4615special charset @code{UCS-2}, the input of the second single step being
4616also @code{UCS-2}.  Special @code{recode} machinery dynamically produces
4617efficient, reversible, merge-able single steps out of these double steps.
4618
4619@cindex recoding steps, statistics
4620@cindex average number of recoding steps
4621I made some statistics about how many internal recoding steps are required
4622between any two charsets chosen at random.  The initial recoding layout,
4623before optimisation, always uses between 1 and 5 steps.  Optimisation could
4624sometimes produce mere copies, which are counted as no steps at all.
4625In other cases, optimisation is unable to save any step.  The number of
4626steps after optimisation is currently between 0 and 5 steps.  Of course,
4627the @emph{expected} number of steps is affected by optimisation: it drops
4628from 2.8 to 1.8.  This means that @code{recode} uses a theoretical average
4629of a bit less than one step per recoding job.  This looks good.  This was
4630computed using reversible recodings.  In strict mode, optimisation might
4631be defeated somewhat.  Number of steps run between 1 and 6, both before
4632and after optimisation, and the expected number of steps decreases by a
4633lesser amount, going from 2.2 to 1.3.  This is still manageable.
4634
4635@node New charsets, New surfaces, Main flow, Internals
4636@section Adding new charsets
4637@cindex adding new charsets
4638@cindex new charsets, how to add
4639
4640The main part of @code{recode} is written in C, as are most single
4641steps.  A few single steps need to recognise sequences of multiple
4642characters, they are often better written in Flex.  It is easy for a
4643programmer to add a new charset to @code{recode}.  All it requires
4644is making a few functions kept in a single @file{.c} file,
4645adjusting @file{Makefile.am} and remaking @code{recode}.
4646
4647One of the function should convert from any previous charset to the
4648new one.  Any previous charset will do, but try to select it so you will
4649not lose too much information while converting.  The other function should
4650convert from the new charset to any older one.  You do not have to select
4651the same old charset than what you selected for the previous routine.
4652Once again, select any charset for which you will not lose too much
4653information while converting.
4654
4655If, for any of these two functions, you have to read multiple bytes of the
4656old charset before recognising the character to produce, you might prefer
4657programming it in Flex in a separate @file{.l} file.  Prototype your
4658C or Flex files after one of those which exist already, so to keep the
4659sources uniform.  Besides, at @code{make} time, all @file{.l} files are
4660automatically merged into a single big one by the script @file{mergelex.awk}.
4661
4662There are a few hidden rules about how to write new @code{recode}
4663modules, for allowing the automatic creation of @file{decsteps.h}
4664and @file{initsteps.h} at @code{make} time, or the proper merging of
4665all Flex files.  Mimetism is a simple approach which relieves me of
4666explaining all these rules!  Start with a module closely resembling
4667what you intend to do.  Here is some advice for picking up a model.
4668First decide if your new charset module is to be be driven by algorithms
4669rather than by tables.  For algorithmic recodings, see @file{iconqnx.c} for
4670C code, or @file{txtelat1.l} for Flex code.  For table driven recodings,
4671see @file{ebcdic.c} for one-to-one style recodings, @file{lat1html.c}
4672for one-to-many style recodings, or @file{atarist.c} for double-step
4673style recodings.  Just select an example from the style that better fits
4674your application.
4675
4676Each of your source files should have its own initialisation function,
4677named @code{module_@var{charset}}, which is meant to be executed
4678@emph{quickly} once, prior to any recoding.  It should declare the
4679name of your charsets and the single steps (or elementary recodings)
4680you provide, by calling @code{declare_step} one or more times.
4681Besides the charset names, @code{declare_step} expects a description
4682of the recoding quality (see @file{recodext.h}) and two functions you
4683also provide.
4684
4685The first such function has the purpose of allocating structures,
4686pre-conditioning conversion tables, etc.  It is also the way of further
4687modifying the @code{STEP} structure.  This function is executed if and
4688only if the single step is retained in an actual recoding sequence.
4689If you do not need such delayed initialisation, merely use @code{NULL}
4690for the function argument.
4691
4692The second function executes the elementary recoding on a whole file.
4693There are a few cases when you can spare writing this function:
4694
4695@c FIXME: functions file_one_to_one and file_one_to_many don't exist!
4696@itemize @bullet
4697@item
4698@findex file_one_to_one
4699Some single steps do nothing else than a pure copy of the input onto the
4700output, in this case, you can use the predefined function
4701@code{file_one_to_one}, while having a delayed initialisation for
4702presetting the @code{STEP} field @code{one_to_one} to the predefined
4703value @code{one_to_same}.
4704
4705@item
4706Some single steps are driven by a table which recodes one character into
4707another; if the recoding does nothing else, you can use the predefined
4708function @code{file_one_to_one}, while having a delayed initialisation
4709for presetting the @code{STEP} field @code{one_to_one} with your table.
4710
4711@item
4712@findex file_one_to_many
4713Some single steps are driven by a table which recodes one character into
4714a string; if the recoding does nothing else, you can use the predefined
4715function @code{file_one_to_many}, while having a delayed initialisation
4716for presetting the @code{STEP} field @code{one_to_many} with your table.
4717@end itemize
4718
4719If you have a recoding table handy in a suitable format but do not use
4720one of the predefined recoding functions, it is still a good idea to use
4721a delayed initialisation to save it anyway, because @code{recode} option
4722@samp{-h} will take advantage of this information when available.
4723
4724Finally, edit @file{Makefile.am} to add the source file name of your routines
4725to the @code{C_STEPS} or @code{L_STEPS} macro definition, depending on
4726the fact your routines is written in C or in Flex.
4727
4728@node New surfaces, Design, New charsets, Internals
4729@section Adding new surfaces
4730@cindex adding new surfaces
4731@cindex new surfaces, how to add
4732
4733Adding a new surface is technically quite similar to adding a new charset.
4734@xref{New charsets}.  A surface is provided as a set of two transformations:
4735one from the predefined special charset @code{data} or @code{tree} to the
4736new surface, meant to apply the surface, the other from the new surface
4737to the predefined special charset @code{data} or @code{tree}, meant to
4738remove the surface.
4739
4740@findex declare_step
4741Internally in @code{recode}, function @code{declare_step} especially
4742recognises when a charset is so related to @code{data} or @code{tree},
4743and then takes appropriate actions so that charset gets indeed installed
4744as a surface.
4745
4746@node Design,  , New surfaces, Internals
4747@section Comments on the library design
4748
4749@itemize @bullet
4750@item Why a shared library?
4751@cindex shared library implementation
4752
4753There are many different approaches to reduce system requirements to
4754handle all tables needed in the @code{recode} library.  One of them is to
4755have the tables in an external format and only read them in on demand.
4756After having pondered this for a while, I finally decided against it,
4757mainly because it involves its own kind of installation complexity, and
4758it is not clear to me that it would be as interesting as I first imagined.
4759
4760It looks more efficient to see all tables and algorithms already mapped
4761into virtual memory from the start of the execution, yet not loaded in
4762actual memory, than to go through many disk accesses for opening various
4763data files once the program is already started, as this would be needed
4764with other solutions.  Using a shared library also has the indirect effect
4765of making various algorithms handily available, right in the same modules
4766providing the tables.  This alleviates much the burden of the maintenance.
4767
4768Of course, I would like to later make an exception for only a few tables,
4769built locally by users for their own particular needs once @code{recode}
4770is installed.  @code{recode} should just go and fetch them.  But I do not
4771perceive this as very urgent, yet useful enough to be worth implementing.
4772
4773Currently, all tables needed for recoding are precompiled into binaries,
4774and all these binaries are then made into a shared library.  As an initial
4775step, I turned @code{recode} into a main program and a non-shared library,
4776this allowed me to tidy up the API, get rid of all global variables, etc.
4777It required a surprising amount of program source massaging.  But once
4778this cleaned enough, it was easy to use Gordon Matzigkeit's @code{libtool}
4779package, and take advantage of the Automake interface to neatly turn the
4780non-shared library into a shared one.
4781
4782Sites linking with the @code{recode} library, whose system does not
4783support any form of shared libraries, might end up with bulky executables.
4784Surely, the @code{recode} library will have to be used statically, and
4785might not very nicely usable on such systems.  It seems that progress
4786has a price for those being slow at it.
4787
4788There is a locality problem I did not address yet.  Currently, the
4789@code{recode} library takes many cycles to initialise itself, calling
4790each module in turn for it to set up associated knowledge about charsets,
4791aliases, elementary steps, recoding weights, etc.  @emph{Then}, the
4792recoding sequence is decided out of the command given.  I would not be
4793surprised if initialisation was taking a perceivable fraction of a second
4794on slower machines.  One thing to do, most probably not right in version
47953.5, but the version after, would have @code{recode} to pre-load all tables
4796and dump them at installation time.  The result would then be compiled and
4797added to the library.  This would spare many initialisation cycles, but more
4798importantly, would avoid calling all library modules, scattered through the
4799virtual memory, and so, possibly causing many spurious page exceptions each
4800time the initialisation is requested, at least once per program execution.
4801
4802@item Why not a central charset?
4803
4804It would be simpler, and I would like, if something like @w{ISO 10646} was
4805used as a turning template for all charsets in @code{recode}.  Even if
4806I think it could help to a certain extent, I'm still not fully sure it
4807would be sufficient in all cases.  Moreover, some people disagree about
4808using @w{ISO 10646} as the central charset, to the point I cannot totally
4809ignore them, and surely, @code{recode} is not a mean for me to force my
4810own opinions on people.  I would like that @code{recode} be practical
4811more than dogmatic, and reflect usage more than religions.
4812
4813Currently, if you ask @code{recode} to go from @var{charset1} to
4814@var{charset2} chosen at random, it is highly probable that the best path
4815will be quickly found as:
4816
4817@example
4818@var{charset1}..@code{UCS-2}..@var{charset2}
4819@end example
4820
4821That is, it will almost always use the @code{UCS} as a trampoline between
4822charsets.  However, @code{UCS-2} will be immediately be optimised out,
4823and @var{charset1}..@var{charset2} will often be performed in a single
4824step through a permutation table generated on the fly for the circumstance
4825@footnote{If strict mapping is requested, another efficient device will
4826be used instead of a permutation.}.
4827
4828In those few cases where @code{UCS-2} is not selected as a conceptual
4829intermediate, I plan to study if it could be made so.  But I guess some cases
4830will remain where @code{UCS-2} is not a proper choice.  Even if @code{UCS} is
4831often the good choice, I do not intend to forcefully restrain @code{recode}
4832around @code{UCS-2} (nor @code{UCS-4}) for now.  We might come to that
4833one day, but it will come out of the natural evolution of @code{recode}.
4834It will then reflect a fact, rather than a preset dogma.
4835
4836@item Why not @code{iconv}?
4837
4838@cindex @code{iconv}
4839The @code{iconv} routine and library allows for converting characters
4840from an input buffer to an input buffer, synchronously advancing both
4841buffer cursors.  If the output buffer is not big enough to receive
4842all of the conversion, the routine returns with the input cursor set at
4843the position where the conversion could later be resumed, and the output
4844cursor set to indicate until where the output buffer has been filled.
4845Despite this scheme is simple and nice, the @code{recode} library does
4846not offer it currently.  Why not?
4847
4848When long sequences of decodings, stepwise recodings, and re-encodings
4849are involved, as it happens in true life, synchronising the input buffer
4850back to where it should have stopped, when the output buffer becomes full,
4851is a difficult problem.  Oh, we could make it simpler at the expense of
4852loosing space or speed: by inserting markers between each input character
4853and counting them at the output end; by processing only one character in a
4854time through the whole sequence; by repeatedly attempting to recode various
4855subsets of the input buffer, binary searching on their length until the
4856output just fits.  The overhead of such solutions looks fully prohibitive
4857to me, and the gain very minimal.  I do not see a real advantage, nowadays,
4858imposing a fixed length to an output buffer.  It makes things so much
4859simpler and efficient to just let the output buffer size float a bit.
4860
4861Of course, if the above problem was solved, the @code{iconv} library
4862should be easily emulated, given that @code{recode} has similar knowledge
4863about charsets, of course.  This either solved or not, the @code{iconv}
4864program remains trivial (given similar knowledge about charsets).
4865I also presume that the @code{genxlt} program would be easy too, but
4866I do not have enough detailed specifications of it to be sure.
4867
4868A lot of years ago, @code{recode} was using a similar scheme, and I found
4869it rather hard to manage for some cases.  I rethought the overall structure
4870of @code{recode} for getting away from that scheme, and never regretted it.
4871I perceive @code{iconv} as an artificial solution which surely has some
4872elegances and virtues, but I do not find it really useful as it stands: one
4873always has to wrap @code{iconv} into something more refined, extending it
4874for real cases.  From past experience, I think it is unduly hard to fully
4875implement this scheme.  It would be awkward that we do contortions for
4876the sole purpose of implementing exactly its specification, without real,
4877fairly sounded reasons (other then the fact some people once thought it
4878was worth standardising).  It is much better to immediately aim for the
4879refinement we need, without uselessly forcing us into the dubious detour
4880@code{iconv} represents.
4881
4882Some may argue that if @code{recode} was using a comprehensive charset
4883as a turning template, as discussed in a previous point, this would make
4884@code{iconv} easier to implement.  Some may be tempted to say that the
4885cases which are hard to handle are not really needed, nor interesting,
4886anyway.  I feel and fear a bit some pressure wanting that @code{recode}
4887be split into the part that well fits the @code{iconv} model, and the part
4888that does not fit, considering this second part less important, with the
4889idea of dropping it one of these days, maybe.  My guess is that users of
4890the @code{recode} library, whatever its form, would not like to have such
4891arbitrary limitations.  In the long run, we should not have to explain
4892to our users that some recodings may not be made available just because
4893they do not fit the simple model we had in mind when we did it.  Instead,
4894we should try to stay opened to the difficulties of real life.  There is
4895still a lot of complex needs for Asian people, say, that @code{recode}
4896does not currently address, while it should.  Not only the doors should
4897stay open, but we should force them wider!
4898@end itemize
4899
4900@node Concept Index, Option Index, Internals, Top
4901@unnumbered Concept Index
4902
4903@printindex cp
4904
4905@node Option Index, Library Index, Concept Index, Top
4906@unnumbered Option Index
4907
4908This is an alphabetical list of all command-line options accepted by
4909@code{recode}.
4910
4911@printindex op
4912
4913@node Library Index, Charset and Surface Index, Option Index, Top
4914@unnumbered Library Index
4915
4916This is an alphabetical index of important functions, data structures,
4917and variables in the @code{recode} library.
4918
4919@printindex fn
4920
4921@node Charset and Surface Index,  , Library Index, Top
4922@unnumbered Charset and Surface Index
4923
4924This is an alphabetical list of all the charsets and surfaces supported
4925by @code{recode}, and their aliases.
4926
4927@printindex tp
4928
4929@contents
4930@bye
4931
4932@c Local Variables:
4933@c texinfo-column-for-description: 24
4934@c End:
4935