• Home
  • History
  • Annotate
Name Date Size #Lines LOC

..03-May-2022-

belarusian/H07-May-2022-1,1661,092

bulgarian/H07-May-2022-829771

chinese/H03-May-2022-1,4931,422

croatian/H07-May-2022-388355

czech/H07-May-2022-479440

estonian/H07-May-2022-419383

hungarian/H07-May-2022-394361

latvian/H07-May-2022-430394

lithuanian/H07-May-2022-454418

maps/H05-Sep-2016-6,4256,400

polish/H07-May-2022-527485

russian/H07-May-2022-944886

slovak/H07-May-2022-476437

slovene/H07-May-2022-350317

ukrainian/H07-May-2022-1,036970

LettersH A D14-Jan-20105.5 KiB1,1181,117

Makefile.amH A D04-Jan-20163.1 KiB146129

Makefile.inH A D03-May-202223.5 KiB804696

READMEH A D04-Jan-20163.5 KiB7760

basetoc.cH A D14-Jan-2010744 3830

clean.shH A D04-Jan-2016306 1311

countall.cH A D14-Jan-2010440 2921

countpair.cH A D14-Jan-20102.3 KiB11692

doit.shH A D14-Jan-20101.6 KiB8067

extreme.plH A D14-Jan-2010924 6253

findletters.cH A D14-Jan-2010466 2922

makepaircounts.shH A D04-Jan-2016547 2519

map2letters.shH A D14-Jan-2010160 86

mystrings.cH A D14-Jan-2010805 4638

normalize.plH A D14-Jan-20101.2 KiB7058

pairtoc.cH A D14-Jan-20102 KiB8873

totals.plH A D14-Jan-20102.4 KiB9277

xlt.cH A D14-Jan-2010950 5037

README

1=== Programs ===
2doit.sh -- Regenerates all */*.base and */*.c files from the source one
3           (given as first parameter in */doit.sh), used by */doit.sh to
4           regenerate stuff in individual directories too.  Uses many of
5           following scripts.
6
7*/doit.sh -- Customized scripts for individual directories.  Once a directory
8             contains doit.sh, it's run by the main one.
9
10clean.sh -- Removes most auxiliary files from language subdirs.
11
12basetoc.c -- [filter] Converts one .base file to .c file, used by doit.sh
13              $ ./basetoc <CHARSET.base >CHARSET.c
14
15totals.pl -- Reads generated .c files and computes significancy data, weight
16             sums and other summary data, writes file `totals.c'
17             $ ./totals.pl CHARSET1.c ... CHARSETn.c
18
19normalize.pl -- [filter] Does some kind of funny weight normalization, useful
20                for producing CHARSET.base files, since the weights must fit
21                into unsigned short int:
22                $ ./normalize.pl <COUNTS >NORMALIZED_COUNTS
23                Given a file on command line, it normalizes input to have
24                exactly(!) the same weight sum:
25                $ ./normalize.pl REFERENCE_COUNTS <COUNTS >RENORMALIZED_COUNTS
26                This is not run by doit.sh.
27
28extreme.pl -- Given two count files, it finds characters most suitable for
29              hook deciding between these two, i.e. characters with the
30              biggest difference of occurences:
31              $ ./extreme.pl COUNT1 COUNT2
32
33xlt.c -- [filter] Extremely simple charset converter, to become independent
34         on the other broken converters:
35         $ ./xlt SOURCE.map TARGET.map <TEXT >CONVERTED_TEXT
36
37mystrings.c -- [filter] Extract text chunks from input (strings(1) doesn't
38               seem to do good job on 8bit files):
39               $ ./mystrings <FILE | ...
40
41countall.c -- [filter] Count character frequencies
42              $ ./countall <TEXT >rawcounts.CHARSET
43
44countpair.c -- [filter] Count 8bit letter pair frequencies and print a table
45               containing as much pairs as to get 95% of all
46               $ ./countpair CHARSET.letters <TEXT >paircounts.CHARSET
47
48findletters.c -- [filter] Find what 8bit characters from a charset map are
49                 letters
50                 $ ./findletters CHARSET.map <Letters >CHARSET.letters
51
52map2letters.sh -- Run findletters.c for all charsets in maps/.
53
54=== Data ===
55Letters -- Unicode characters assumed to be letters, excluding 7bits.  Also
56           excluding non-European scripts, to keep it small.
57
58maps/ -- 8bit charset -> UCS2 maps, notable ones:
59  ibm866-bad.map -- Translates Latin `i' and `I' to Cyrillic 0x0456 and 0x0406,
60                    thus approximates them the opposite way when used as
61                    TARGET.
62  maccyr.map -- It's Macintosh Cyrillic after Apple unification of Russian
63                and Ukriainian variants and adding Euro symbol there, in
64                Mac OS 9.0 or so (recode uses the old Russian maccyr -- FIXME
65                with iconv it doesn't?).
66  macce.map -- Macintosh Central European encoding, the real one, not the
67               crappy one used by recode.
68  koi8u.map -- KOI8-U (Ukrainian) (recode uses some strange mapping?).
69  koi8uni.map -- KOI8-Unified.
70  koi8ub.map -- KOI8-UB (Ukrainian/Belarusian).
71  cork.map -- T1 Cork encoding (recode uses some strange mapping?).
72  iso885913.map -- ISO-8859-13 map (recode uses some strange mapping?).
73
74letters/ -- lists of 8bit charset that are letters (generated) for various
75            charsets, run map2letters.sh to create it
76
77