• Home
  • History
  • Annotate
Name Date Size #Lines LOC

..03-May-2022-

BuildLangModelLogs/H03-May-2022-

charsets/H27-Oct-2020-1,798648

langs/H27-Oct-2020-1,757253

BuildLangModel.pyH A D27-Oct-202020.6 KiB532374

READMEH A D27-Oct-20202.5 KiB6445

debug.shH A D27-Oct-2020142 108

gen.shH A D27-Oct-20201.5 KiB2826

header-template.cppH A D27-Oct-20201.8 KiB391

release.shH A D27-Oct-2020124 97

win32.shH A D27-Oct-2020141 77

README

1# Supporting new or Updating languages #
2
3We generate statistical language data using Wikipedia as natural
4language text resource.
5
6Right now, we have automated scripts only to generate statistical data
7for single-byte encodings. Multi-byte encodings usually requires more
8in-depth knowledge of its specification.
9
10## New single-byte encoding ##
11
12Uchardet uses language data, and therefore rather than supporting a
13charset, we in fact support a couple (language, charset). So for
14instance if uchardet supports (French, ISO-8859-15), it should be able
15to recognize French text encoded in ISO-8859-15, but may fail at
16detecting ISO-8859-15 for non-supported languages.
17
18This is why, though less flexible, it also makes uchardet much more
19accurate than other detection system, as well as making it an efficient
20language recognition system.
21Since many single-byte charsets actually share the same layout (or very
22similar ones), it is actually impossible to have an accurate single-byte
23encoding detector for random text.
24
25Therefore you need to describe the language and the codepoint layouts of
26every charset you want to add support for.
27
28I recommend having a look at langs/fr.py which is heavily commented as
29a base of a new language description, and charsets/windows-1252.py as a
30base for a new charset layout (note that charset layouts can be shared
31between languages. If yours is already there, you have nothing to do).
32The important name in the charset file are:
33
34- `name`: an iconv-compatible name.
35- `charmap`: fill it with CTR (control character), SYM (symbol), NUM
36             (number), LET (letter), ILL (illegal codepoint).
37
38## Tools ##
39
40You must install Python 3 and the [`Wikipedia` Python
41tool](https://github.com/goldsmith/Wikipedia).
42
43## Run script ##
44
45Let's say you added (or modified) support for French (`fr`), run:
46
47> ./BuildLangModel.py fr --max-page=100 --max-depth=4
48
49The options can be changed to any value. Bigger values mean the script
50will process more data, so more processing time now, but uchardet may
51possibly be more accurate in the end.
52
53## Updating core code ##
54
55If you were only updating data for a language model, you have nothing
56else to do. Just build `uchardet` again and test it.
57
58If you were creating new models though, you will have to add these in
59src/nsSBCSGroupProber.cpp and src/nsSBCharSetProber.h, and increase the
60value of `NUM_OF_SBCS_PROBERS` in src/nsSBCSGroupProber.h.
61Finally add the new file in src/CMakeLists.txt.
62
63I will be looking to make this step more straightforward in the future.
64