• Home
  • History
  • Annotate
Name Date Size #Lines LOC

..03-May-2022-

Data/H11-May-2020-10,78110,299

benchmark/H26-Sep-2020-229152

test/H26-Sep-2020-247181

unicode-data/H26-Sep-2020-526415

.gitignoreH A D11-May-2020294 2624

.travis.ymlH A D26-Sep-202010 KiB246202

Changelog.mdH A D11-Oct-2020698 5329

LICENSEH A D11-May-20201.5 KiB2822

MAINTAINING.mdH A D26-Sep-2020210 85

NOTES.mdH A D11-May-20202.4 KiB5541

README.mdH A D26-Sep-20203.9 KiB7866

Setup.hsH A D11-May-202046 32

appveyor.ymlH A D26-Sep-20204 KiB9078

stack-7.10.yamlH A D26-Sep-2020191 98

stack-8.0.yamlH A D11-May-202061 65

stack.yamlH A D26-Sep-202035 43

unicode-transforms.cabalH A D03-May-20226.1 KiB223210

README.md

1# Unicode Transforms
2
3[![Hackage](https://img.shields.io/hackage/v/unicode-transforms.svg?style=flat)](https://hackage.haskell.org/package/unicode-transforms)
4[![Build Status](https://travis-ci.com/composewell/unicode-transforms.svg?branch=master)](https://travis-ci.com/composewell/unicode-transforms)
5[![Windows Build status](https://ci.appveyor.com/api/projects/status/5wov8m1m0asvbv32?svg=true)](https://ci.appveyor.com/project/harendra-kumar/unicode-transforms)
6[![Coverage Status](https://coveralls.io/repos/composewell/unicode-transforms/badge.svg?branch=master&service=github)](https://coveralls.io/github/composewell/unicode-transforms?branch=master)
7
8Fast Unicode 13.0.0 normalization in Haskell (NFC, NFKC, NFD, NFKD).
9
10## What is normalization?
11
12Unicode characters with adornments (e.g. Á) can be represented in two different
13forms, as a single composed character (U+00C1 = Á) or as multiple decomposed
14characters (U+0041(A) U+0301( ́ ) = Á). They are differently encoded byte
15sequences but for humans they have exactly the same visual appearance.
16
17A regular byte comparison may tell that two strings are different even though
18they might be equivalent. We need to convert both the strings in a
19[`normalized`](http://unicode.org/reports/tr15/) form using the [Unicode
20Character Database](http://www.unicode.org/Public/UCD/latest/) before we can
21compare them for equivalence. For example:
22```
23>> import Data.Text.Normalize
24>> normalize NFC "\193" == normalize NFC "\65\769"
25True
26```
27
28## Performance
29
30Normalization performance comparison of this package (v0.3.7) with
31the [text-icu](http://hackage.haskell.org/package/text-icu) package
32using the [ICU C++ library](http://site.icu-project.org/download)
33version ICU4C 65.1 on macOS. The benchmarks compare the time taken in
34milliseconds to normalize files in different languages and normalization
35forms using both the packages. In most cases `unicode-transforms`
36outperforms ICU.
37
38```
39Benchmark       unicode-transforms(ms) ICU(ms)    % Diff
40--------------- ---------------------- -------   --------
41NFKD/Korean                       7.78   37.10    +376.87
42NFD/Korean                        7.86   37.06    +371.50
43NFKD/Vietnamese                   6.85   12.48     +82.20
44NFKD/Deutsch                      2.17    3.55     +63.30
45NFKD/English                      1.71    2.78     +62.30
46NFKC/Korean                       4.77    7.65     +60.28
47NFD/Deutsch                       2.24    3.53     +57.41
48NFD/English                       1.76    2.77     +57.32
49NFC/Vietnamese                   10.66   16.63     +56.00
50NFKC/Vietnamese                  10.95   16.58     +51.43
51NFD/Devanagari                    6.48    8.68     +34.10
52NFC/Devanagari                    6.77    8.49     +25.48
53NFD/AllChars                      6.18    7.41     +19.91
54NFD/Japanese                      7.80    9.20     +17.99
55NFKC/Devanagari                   7.33    8.48     +15.74
56NFKD/Japanese                     8.71   10.05     +15.39
57NFD/Vietnamese                    5.94    6.83     +14.99
58NFKD/Devanagari                   7.59    8.68     +14.27
59NFKD/AllChars                     9.80   10.66      +8.82
60NFKC/Deutsch                      3.21    3.18      -0.72
61NFC/Korean                        4.62    4.38      -5.35
62NFKC/English                      2.21    2.06      -6.88
63NFC/English                       2.19    2.04      -7.21
64NFKC/AllChars                    14.67    9.75     -50.51
65NFC/Deutsch                       3.02    1.95     -54.39
66NFKC/Japanese                    12.46    5.42    -129.93
67NFC/AllChars                      9.72    3.58    -171.63
68NFC/Japanese                     11.90    3.04    -292.04
69```
70
71## Talks
72
73* Talks: [Functional Conf 2018 Video](https://www.youtube.com/watch?v=aJvwORrBJ0o) | [Functional Conf 2018 Slides](https://www.slideshare.net/HarendraKumar10/high-performance-haskell)
74
75## Contributing
76Please use https://github.com/harendra-kumar/unicode-transforms to raise
77issues, or send pull requests.
78