README
1=== Programs ===
2doit.sh -- Regenerates all */*.base and */*.c files from the source one
3 (given as first parameter in */doit.sh), used by */doit.sh to
4 regenerate stuff in individual directories too. Uses many of
5 following scripts.
6
7*/doit.sh -- Customized scripts for individual directories. Once a directory
8 contains doit.sh, it's run by the main one.
9
10clean.sh -- Removes most auxiliary files from language subdirs.
11
12basetoc.c -- [filter] Converts one .base file to .c file, used by doit.sh
13 $ ./basetoc <CHARSET.base >CHARSET.c
14
15totals.pl -- Reads generated .c files and computes significancy data, weight
16 sums and other summary data, writes file `totals.c'
17 $ ./totals.pl CHARSET1.c ... CHARSETn.c
18
19normalize.pl -- [filter] Does some kind of funny weight normalization, useful
20 for producing CHARSET.base files, since the weights must fit
21 into unsigned short int:
22 $ ./normalize.pl <COUNTS >NORMALIZED_COUNTS
23 Given a file on command line, it normalizes input to have
24 exactly(!) the same weight sum:
25 $ ./normalize.pl REFERENCE_COUNTS <COUNTS >RENORMALIZED_COUNTS
26 This is not run by doit.sh.
27
28extreme.pl -- Given two count files, it finds characters most suitable for
29 hook deciding between these two, i.e. characters with the
30 biggest difference of occurences:
31 $ ./extreme.pl COUNT1 COUNT2
32
33xlt.c -- [filter] Extremely simple charset converter, to become independent
34 on the other broken converters:
35 $ ./xlt SOURCE.map TARGET.map <TEXT >CONVERTED_TEXT
36
37mystrings.c -- [filter] Extract text chunks from input (strings(1) doesn't
38 seem to do good job on 8bit files):
39 $ ./mystrings <FILE | ...
40
41countall.c -- [filter] Count character frequencies
42 $ ./countall <TEXT >rawcounts.CHARSET
43
44countpair.c -- [filter] Count 8bit letter pair frequencies and print a table
45 containing as much pairs as to get 95% of all
46 $ ./countpair CHARSET.letters <TEXT >paircounts.CHARSET
47
48findletters.c -- [filter] Find what 8bit characters from a charset map are
49 letters
50 $ ./findletters CHARSET.map <Letters >CHARSET.letters
51
52map2letters.sh -- Run findletters.c for all charsets in maps/.
53
54=== Data ===
55Letters -- Unicode characters assumed to be letters, excluding 7bits. Also
56 excluding non-European scripts, to keep it small.
57
58maps/ -- 8bit charset -> UCS2 maps, notable ones:
59 ibm866-bad.map -- Translates Latin `i' and `I' to Cyrillic 0x0456 and 0x0406,
60 thus approximates them the opposite way when used as
61 TARGET.
62 maccyr.map -- It's Macintosh Cyrillic after Apple unification of Russian
63 and Ukriainian variants and adding Euro symbol there, in
64 Mac OS 9.0 or so (recode uses the old Russian maccyr -- FIXME
65 with iconv it doesn't?).
66 macce.map -- Macintosh Central European encoding, the real one, not the
67 crappy one used by recode.
68 koi8u.map -- KOI8-U (Ukrainian) (recode uses some strange mapping?).
69 koi8uni.map -- KOI8-Unified.
70 koi8ub.map -- KOI8-UB (Ukrainian/Belarusian).
71 cork.map -- T1 Cork encoding (recode uses some strange mapping?).
72 iso885913.map -- ISO-8859-13 map (recode uses some strange mapping?).
73
74letters/ -- lists of 8bit charset that are letters (generated) for various
75 charsets, run map2letters.sh to create it
76
77