• Home
  • History
  • Annotate
Name Date Size #Lines LOC

..15-Dec-2021-

CleanUpTest.phpH A D15-Dec-202115.4 KiB469380

MakefileH A D15-Dec-20212.7 KiB7347

READMEH A D15-Dec-20212.2 KiB5638

RandomTest.phpH A D15-Dec-20213.1 KiB11264

Utf8Test.phpH A D15-Dec-20214 KiB15295

UtfNormal.phpH A D15-Dec-202131 KiB841527

UtfNormalBench.phpH A D15-Dec-20213 KiB11468

UtfNormalData.incH A D15-Dec-2021106.8 KiB1413

UtfNormalDataK.incH A D15-Dec-2021107.7 KiB1110

UtfNormalGenerate.phpH A D15-Dec-20217 KiB245193

UtfNormalTest.phpH A D15-Dec-20218.5 KiB272212

UtfNormalUtil.phpH A D15-Dec-20214 KiB16079

README

1This directory contains some Unicode normalization routines. These routines
2are meant to be reusable in other projects, so I'm not tying them to the
3MediaWiki utility functions.
4
5The main function to care about is UtfNormal::toNFC(); this will convert
6a given UTF-8 string to Normalization Form C if it's not already such.
7The function assumes that the input string is already valid UTF-8; if there
8are corrupt characters this may produce erroneous results.
9
10To also check for illegal characters, use UtfNormal::cleanUp(). This will
11strip illegal UTF-8 sequences and characters that are illegal in XML, and
12if necessary convert to normalization form C.
13
14Performance is kind of stinky in absolute terms, though it should be speedy
15on pure ASCII text. ;) On text that can be determined quickly to already be
16in NFC it's not too awful but it can quickly get uncomfortably slow,
17particularly for Korean text (the hangul decomposition/composition code is
18extra slow).
19
20
21== Regenerating data tables ==
22
23UtfNormalData.inc and UtfNormalDataK.inc are generated from the Unicode
24Character Database by the script UtfNormalGenerate.php. On a *nix system
25'make' should fetch the necessary files and regenerate it if the scripts
26have been changed or you remove it.
27
28
29== Testing ==
30
31'make test' will run the conformance test (UtfNormalTest.php), fetching the
32data from from the net if necessary. If it reports failure, something is
33going wrong!
34
35
36== Benchmarks ==
37
38Run 'make bench' to download some sample texts from Wikipedia and run some
39cheap benchmarks of some of the functions. Take all numbers with large
40grains of salt.
41
42
43== PHP module extension ==
44
45There's an experimental PHP extension module which wraps the ICU library's
46normalization functions. This is *MUCH* faster than doing this work in pure
47PHP code. This is in the 'normal' directory in MediaWiki's CVS extensions
48module. It is known to work with PHP 4.3.8 and 5.0.2 on Linux/x86 but hasn't
49been thoroughly tested on other configurations.
50
51If the php_normal.so module is loaded in php.ini, the normalization functions
52will automatically use it. If you can't (or don't want to) load it in php.ini,
53you may be able to load it using the dl() function before include()ing or
54require()ing UtfNormal.php, and it will be picked up.
55
56