1COMBINE_TESSDATA(1)
2===================
3
4NAME
5----
6combine_tessdata - combine/extract/overwrite/list/compact Tesseract data
7
8SYNOPSIS
9--------
10*combine_tessdata* ['OPTION'] 'FILE'...
11
12DESCRIPTION
13-----------
14combine_tessdata(1) is the main program to combine/extract/overwrite/list/compact
15tessdata components in [lang].traineddata files.
16
17To combine all the individual tessdata components (unicharset, DAWGs,
18classifier templates, ambiguities, language configs) located at, say,
19/home/$USER/temp/eng.* run:
20
21  combine_tessdata /home/$USER/temp/eng.
22
23The result will be a combined tessdata file /home/$USER/temp/eng.traineddata
24
25Specify option -e if you would like to extract individual components
26from a combined traineddata file. For example, to extract language config
27file and the unicharset from tessdata/eng.traineddata run:
28
29  combine_tessdata -e tessdata/eng.traineddata \
30    /home/$USER/temp/eng.config /home/$USER/temp/eng.unicharset
31
32The desired config file and unicharset will be written to
33/home/$USER/temp/eng.config /home/$USER/temp/eng.unicharset
34
35Specify option -o to overwrite individual components of the given
36[lang].traineddata file. For example, to overwrite language config
37and unichar ambiguities files in tessdata/eng.traineddata use:
38
39  combine_tessdata -o tessdata/eng.traineddata \
40    /home/$USER/temp/eng.config /home/$USER/temp/eng.unicharambigs
41
42As a result, tessdata/eng.traineddata will contain the new language config
43and unichar ambigs, plus all the original DAWGs, classifier templates, etc.
44
45Note: the file names of the files to extract to and to overwrite from should
46have the appropriate file suffixes (extensions) indicating their tessdata
47component type (.unicharset for the unicharset, .unicharambigs for unichar
48ambigs, etc). See k*FileSuffix variable in ccutil/tessdatamanager.h.
49
50Specify option -u to unpack all the components to the specified path:
51
52    combine_tessdata -u tessdata/eng.traineddata /home/$USER/temp/eng.
53
54This will create  /home/$USER/temp/eng.* files with individual tessdata
55components from tessdata/eng.traineddata.
56
57OPTIONS
58-------
59
60*-c* '.traineddata' 'FILE'...:
61    Compacts the LSTM component in the .traineddata file to int.
62
63*-d* '.traineddata' 'FILE'...:
64    Lists directory of components from the .traineddata file.
65
66*-e* '.traineddata' 'FILE'...:
67    Extracts the specified components from the .traineddata file
68
69*-l* '.traineddata' 'FILE'...:
70   List the network information.
71
72*-o* '.traineddata' 'FILE'...:
73    Overwrites the specified components of the .traineddata file
74    with those provided on the command line.
75
76*-u* '.traineddata' 'PATHPREFIX'
77    Unpacks the .traineddata using the provided prefix.
78
79CAVEATS
80-------
81'Prefix' refers to the full file prefix, including period (.)
82
83
84COMPONENTS
85----------
86The components in a Tesseract lang.traineddata file as of
87Tesseract 4.0 are briefly described below; For more information on
88many of these files, see
89<https://tesseract-ocr.github.io/tessdoc/Training-Tesseract.html>
90and
91<https://tesseract-ocr.github.io/tessdoc/TrainingTesseract-4.00.html>
92
93lang.config::
94  (Optional) Language-specific overrides to default config variables.
95  For 4.0 traineddata files, lang.config provides control parameters which
96  can affect layout analysis, and sub-languages.
97
98lang.unicharset::
99  (Required - 3.0x  legacy tesseract) The list of symbols that Tesseract recognizes, with properties.
100  See unicharset(5).
101
102lang.unicharambigs::
103  (Optional - 3.0x  legacy tesseract) This file contains information on pairs of recognized symbols
104  which are often confused.  For example, 'rn' and 'm'.
105
106lang.inttemp::
107  (Required - 3.0x  legacy tesseract) Character shape templates for each unichar.  Produced by
108  mftraining(1).
109
110lang.pffmtable::
111  (Required - 3.0x  legacy tesseract) The number of features expected for each unichar.
112  Produced by mftraining(1) from *.tr* files.
113
114lang.normproto::
115  (Required - 3.0x  legacy tesseract) Character normalization prototypes generated by cntraining(1)
116  from *.tr* files.
117
118lang.punc-dawg::
119  (Optional - 3.0x  legacy tesseract) A dawg made from punctuation patterns found around words.
120  The "word" part is replaced by a single space.
121
122lang.word-dawg::
123  (Optional - 3.0x  legacy tesseract) A dawg made from dictionary words from the language.
124
125lang.number-dawg::
126  (Optional - 3.0x  legacy tesseract) A dawg made from tokens which originally contained digits.
127  Each digit is replaced by a space character.
128
129lang.freq-dawg::
130  (Optional - 3.0x  legacy tesseract) A dawg made from the most frequent words which would have
131  gone into word-dawg.
132
133lang.fixed-length-dawgs::
134  (Optional - 3.0x  legacy tesseract) Several dawgs of different fixed lengths -- useful for
135  languages like Chinese.
136
137lang.shapetable::
138  (Optional - 3.0x  legacy tesseract) When present, a shapetable is an extra layer between the character
139  classifier and the word recognizer that allows the character classifier to
140  return a collection of unichar ids and fonts instead of a single unichar-id
141  and font.
142
143lang.bigram-dawg::
144  (Optional - 3.0x  legacy tesseract) A dawg of word bigrams where the words are separated by a space
145  and each digit is replaced by a '?'.
146
147lang.unambig-dawg::
148  (Optional - 3.0x  legacy tesseract) .
149
150lang.params-model::
151  (Optional - 3.0x  legacy tesseract) .
152
153lang.lstm::
154  (Required - 4.0 LSTM) Neural net trained recognition model generated by lstmtraining.
155
156lang.lstm-punc-dawg::
157  (Optional - 4.0 LSTM) A dawg made from punctuation patterns found around words.
158  The "word" part is replaced by a single space. Uses lang.lstm-unicharset.
159
160lang.lstm-word-dawg::
161  (Optional - 4.0 LSTM) A dawg made from dictionary words from the language.
162  Uses lang.lstm-unicharset.
163
164lang.lstm-number-dawg::
165  (Optional - 4.0 LSTM) A dawg made from tokens which originally contained digits.
166  Each digit is replaced by a space character. Uses lang.lstm-unicharset.
167
168lang.lstm-unicharset::
169  (Required - 4.0 LSTM) The unicode character set that Tesseract recognizes, with properties.
170  Same unicharset must be used to train the LSTM and build the lstm-*-dawgs files.
171
172lang.lstm-recoder::
173  (Required - 4.0 LSTM) Unicharcompress, aka the recoder, which maps the unicharset
174  further to the codes actually used by the neural network recognizer. This is created as
175  part of the starter traineddata by combine_lang_model.
176
177lang.version::
178  (Optional) Version string for the traineddata file.
179  First appeared in version 4.0 of Tesseract.
180  Old version of traineddata files will report Version:Pre-4.0.0.
181  4.0 version of traineddata files may include the network spec
182  used for LSTM training as part of version string.
183
184HISTORY
185-------
186combine_tessdata(1) first appeared in version 3.00 of Tesseract
187
188SEE ALSO
189--------
190tesseract(1), wordlist2dawg(1), cntraining(1), mftraining(1), unicharset(5),
191unicharambigs(5)
192
193COPYING
194-------
195Copyright \(C) 2009, Google Inc.
196Licensed under the Apache License, Version 2.0
197
198AUTHOR
199------
200The Tesseract OCR engine was written by Ray Smith and his research groups
201at Hewlett Packard (1985-1995) and Google (2006-present).
202