• Home
  • History
  • Annotate
Name Date Size #Lines LOC

..03-May-2022-

build-aux/H29-Jan-2021-16,61512,356

datrie/H03-May-2022-5,7233,193

doc/H03-May-2022-2,8732,251

m4/H29-Jan-2021-9,0828,210

man/H03-May-2022-676584

tests/H03-May-2022-3,3132,522

tools/H03-May-2022-1,3131,087

AUTHORSH A D23-Jan-202149 21

COPYINGH A D23-Jan-202125.9 KiB511422

ChangeLogH A D29-Jan-202175.3 KiB2,2601,563

INSTALLH A D04-Jan-202115.4 KiB369287

Makefile.amH A D23-Jan-2021254 1510

Makefile.inH A D03-May-202228.3 KiB906803

NEWSH A D29-Jan-20215.5 KiB175152

READMEH A D23-Jan-20211.2 KiB3424

README.migrationH A D23-Jan-20217.2 KiB183135

VERSIONH A D29-Jan-20217 21

aclocal.m4H A D24-Jan-202147.5 KiB1,3161,199

config.h.inH A D24-Jan-20212.1 KiB8355

configureH A D29-Jan-2021446.5 KiB15,27112,795

configure.acH A D24-Jan-20214.3 KiB146125

datrie-0.2.pc.inH A D23-Jan-2021205 119

README

1datrie - Double-Array Trie Library
2==================================
3
4This is an implementation of double-array structure for representing trie,
5as proposed by Junichi Aoe [1].
6
7Trie is a kind of digital search tree, an efficient indexing method in which
8search time is independent of database size. It only takes O(m) search time,
9where m is the length of the search string. Comparably as efficient as hashing,
10trie also provides flexibility on incremental matching and key spelling
11manipulation. This makes it ideal for lexical analyzers, as well as spelling
12dictionaries.
13
14See the details of the implementation at [2]:
15  https://linux.thai.net/~thep/datrie/datrie.html
16
17Historically, this was first implemented as C++ classes in a library called
18midatrie [2], but later simplified and rewritten from scratch in C.
19
20--
21Theppitak Karoonboonyanan.
22September 2006.
23
24References
25----------
26
27[1] Aoe, J. "An Efficient Digital Search Algorithm by Using a Double-Array
28    Structure". IEEE Transactions on Software Engineering. Vol. 15, 9
29    (Sep 1989). pp. 1066-1077.
30
31[2] Karoonboonyanan, T. "An Implementation of Double-Array Trie".
32    https://linux.thai.net/~thep/datrie/datrie.html
33
34

README.migration

1MIGRATION FROM 0.1.X TO 0.2.X
2
30.2.x breaks 0.1.x interoperability in many ways, to allow more use cases, and
4to provide more storage capacity.
5
61. Binary Data Changes
7
81.1 All Trie Data in Single File
9
10No more splitting of a trie into '{trie-name}.sbm', '{trie-name}.br' and
11'{trie-name}.tl'. All parts are now stored in a single file, '{trie-name}.tri'.
12
13Note, however, that a '{trie-name}.abm' (a renamed version of '{trie-name}.sbm'
14after Unicode support) is still needed on first creation. But once created,
15the '{trie-name}.tri' will incorporate the alphabet map data, and no
16'{trie-name}.abm' is required in later uses. It will even be ignored if exists.
17
181.2 32-Bit Node Index
19
20To accommodate larger word lists, trie node indices are now 32 bits, instead of
2116 bits. This means 32,767 times capacity compared to the old format.
22Therefore, the data size are doubled in general when migrating from old format,
23but it can now hold exponentially more entries.
24
25In addition, the tail block lengths are now 16 bits, instead of 8 bits, making
26it possible to store longer suffixes, for dictionaries of extremely long words.
27
281.3 No Backward Compatibility
29
30For simplicity of the code, it was decided not to read/write old format files.
31If you still prefer using the old format, just stay with the old version. If
32you like to gain more support from the new version, you can migrate your old
33data by first dumping your dictionary with 0.1.x trietool into text file and
34then creating the new dictionary with the dumped word list. Or if you already
35have the word list, that makes things a lot easier. Just create the dictionary
36with the new trietool.
37
38Data Migration Steps:
39
40  a. If you have the word list source, just skip to next step. Otherwise, you
41     can dump the old data with 0.1.x trietool:
42
43       $ trietool {trie-name} list > words.lst
44
45  b. Prepare '{trie-name}.abm', listing ranges of characters used in the word
46     list, in terms of Unicode values. For example, for an English and Thai
47     dictionary:
48
49       [0x0041,0x005a]
50       [0x0061,0x007a]
51       [0x0e01,0x0e3a]
52       [0x0e40,0x0e4e]
53
54  c. Generate new trie with 0.2.x trietool-0.2. For example:
55
56       $ trietool-0.2 {trie-name} add-list -e TIS-620 words.lst
57
58     In this example, the '-e TIS-620' indicates that the 'words.lst' file
59     contains TIS-620 encoded text, which is most likely for word lists dumped
60     from the old trie with 8-bit Thai character code as the key encoding.
61     Replace it with your old encoding as necessary, such as ISO-8859-1 or the
62     like. If '-e' option is omitted, current locale encoding is assumed.
63     See trietool-0.2 man page for details.
64
652. API Changes
66
672.1 Non-File Trie Usage
68
69In datrie 0.1.x, every trie was associated with a set of files. Now, this is
70not only reduced to a single file, but zero file is also possible. That is, a
71new trie can be created in memory, added words, removed words, queried words,
72and then disposed without writing data to any file. Meanwhile, saving to file
73is still possible.
74
75  Scenario 1: Loading trie from file, using it read-only.
76    1a. Open trie with trie_new_from_file(path).
77    1b. Use it.
78    1c. On exit:
79        - Close it with trie_free().
80
81  Scenario 2: Loading trie from file, updating file when finished.
82    2a. Open trie with trie_new_from_file(path).
83    2b. Use/update it.
84    2c. On exit:
85        - If trie_is_dirty(), then trie_save().
86        - Close it with trie_free().
87
88  Scenario 3: Create a new trie, saving it when finished.
89    3a. Prepare an alphabet map:
90        - Create new alphabet map with alpha_map_new().
91        - Add ranges with alpha_map_add_range().
92    3b. Create new trie with trie_new(alpha_map).
93    3c. Free the alphabet map with alpha_map_free().
94    3d. Use/update the trie.
95    3e. On exit:
96        - If trie_is_dirty(), then trie_save().
97        - Close the trie with trie_free().
98
99  Scenario 4: Create temporary trie, disposing it when finished.
100    4a. Prepare an alphabet map:
101        - Create new alphabet map with alpha_map_new().
102        - Add ranges with alpha_map_add_range().
103    4b. Create new trie with trie_new(alpha_map).
104    4c. Free the alphabet map with alpha_map_free().
105    4d. Use/update the trie.
106    4e. On exit:
107        - Close the trie with trie_free().
108
1092.2 No More SBTrie
110
111In datrie 0.1.x, SBTrie provided a wrapper to Trie implementation, converting
112between real character codes and trie internal codes. This was for compactness,
113as continuous character code range can cause more compact sparse table
114allocation, while the real alphabet set needs not be continuous. However, in
115datrie 0.2.x, this mapping feature has been merged into Trie class, to reduce
116call layers. So, there is no SBTrie any more. You can call Trie directly in the
117same way you called SBTrie in 0.1.x.
118
1192.3 Characters are Now Unicode
120
121datrie was previously planned to support multiple kinds of character encodings,
122with only single-byte encoding as the available implementation for the time
123being.
124
125However, as there have been many requests for Unicode support, it seems to be
126the most useful choice, into which all other encodings can be converted.
127
128Furthermore, as datrie is mostly used in program's critical path, having too
129many layers can contribute to being a bottleneck. So, only Unicode is accepted
130in this version. It's now the application's duty to convert its keys into
131Unicode before passing them to datrie. This should also allow any kind of
132possible caching.
133
1342.4 New Public APIs for Alphabet Map
135
136As AlphaMap (alphabet map) is now necessary for creating a new empty trie, the
137APIs for manipulating this data is now exposed to the public scope. See
138<datrie/alpha-map.h> for the details.
139
1402.5 Extensions to TrieState
141
142trie_state_copy()
143
144  As part of performance profiling, allocating and freeing TrieState is found
145  to eat up CPU time at some degree. So, reusing existing TrieState where
146  possible does help. This function is added for copying TrieState data, as a
147  better alternative than trie_state_clone().
148
149trie_state_is_single()
150
151  Sometimes, checking if a TrieState is a leaf state is too expensive for
152  program's critical path. It needs to check both whether the state is in a
153  non-branching path, that is, whether it is in a suffix node, and whether it
154  can be walked by a terminator. When a program only needs to check for the
155  former fact and not the latter, this method is at disposal.
156
1573. Changes to TrieTool
158
1593.1 Renaming
160
161To allow co-existence with 0.1.x trietool, 0.2.x trietool is named
162trietool-0.2.
163
1643.2 '*.abm' Instead of '*.sbm'
165
166As SBTrie has been eliminated in datrie 0.2.x, the corresponding '*.sbm'
167(single-byte map) input file is also obsoleted. It is now renamed to '*.abm'
168(alphabet map). Its format is also redefined to be Unicode-based. All alphabet
169character ranges are defined in Unicode.
170
171Besides, the '*.abm' file is required only once at trie creation time. It is
172not needed at deployment, as the alphabet map is already included in the single
173trie file.
174
1753.3 Encoding Conversion Support
176
177As datrie is now Unicode-based, conversion from other encodings can be useful.
178This is possible for word list operations, namely add-list and delete-list, by
179the additional '-e {enc}' or '--encoding {enc}' option. This option specifies
180the character encoding of the word list file. And trietool-0.2 will convert the
181contents to Unicode on-the-fly.
182
183