1COMBINE_TESSDATA(1) 2=================== 3 4NAME 5---- 6combine_tessdata - combine/extract/overwrite/list/compact Tesseract data 7 8SYNOPSIS 9-------- 10*combine_tessdata* ['OPTION'] 'FILE'... 11 12DESCRIPTION 13----------- 14combine_tessdata(1) is the main program to combine/extract/overwrite/list/compact 15tessdata components in [lang].traineddata files. 16 17To combine all the individual tessdata components (unicharset, DAWGs, 18classifier templates, ambiguities, language configs) located at, say, 19/home/$USER/temp/eng.* run: 20 21 combine_tessdata /home/$USER/temp/eng. 22 23The result will be a combined tessdata file /home/$USER/temp/eng.traineddata 24 25Specify option -e if you would like to extract individual components 26from a combined traineddata file. For example, to extract language config 27file and the unicharset from tessdata/eng.traineddata run: 28 29 combine_tessdata -e tessdata/eng.traineddata \ 30 /home/$USER/temp/eng.config /home/$USER/temp/eng.unicharset 31 32The desired config file and unicharset will be written to 33/home/$USER/temp/eng.config /home/$USER/temp/eng.unicharset 34 35Specify option -o to overwrite individual components of the given 36[lang].traineddata file. For example, to overwrite language config 37and unichar ambiguities files in tessdata/eng.traineddata use: 38 39 combine_tessdata -o tessdata/eng.traineddata \ 40 /home/$USER/temp/eng.config /home/$USER/temp/eng.unicharambigs 41 42As a result, tessdata/eng.traineddata will contain the new language config 43and unichar ambigs, plus all the original DAWGs, classifier templates, etc. 44 45Note: the file names of the files to extract to and to overwrite from should 46have the appropriate file suffixes (extensions) indicating their tessdata 47component type (.unicharset for the unicharset, .unicharambigs for unichar 48ambigs, etc). See k*FileSuffix variable in ccutil/tessdatamanager.h. 49 50Specify option -u to unpack all the components to the specified path: 51 52 combine_tessdata -u tessdata/eng.traineddata /home/$USER/temp/eng. 53 54This will create /home/$USER/temp/eng.* files with individual tessdata 55components from tessdata/eng.traineddata. 56 57OPTIONS 58------- 59 60*-c* '.traineddata' 'FILE'...: 61 Compacts the LSTM component in the .traineddata file to int. 62 63*-d* '.traineddata' 'FILE'...: 64 Lists directory of components from the .traineddata file. 65 66*-e* '.traineddata' 'FILE'...: 67 Extracts the specified components from the .traineddata file 68 69*-l* '.traineddata' 'FILE'...: 70 List the network information. 71 72*-o* '.traineddata' 'FILE'...: 73 Overwrites the specified components of the .traineddata file 74 with those provided on the command line. 75 76*-u* '.traineddata' 'PATHPREFIX' 77 Unpacks the .traineddata using the provided prefix. 78 79CAVEATS 80------- 81'Prefix' refers to the full file prefix, including period (.) 82 83 84COMPONENTS 85---------- 86The components in a Tesseract lang.traineddata file as of 87Tesseract 4.0 are briefly described below; For more information on 88many of these files, see 89<https://tesseract-ocr.github.io/tessdoc/Training-Tesseract.html> 90and 91<https://tesseract-ocr.github.io/tessdoc/TrainingTesseract-4.00.html> 92 93lang.config:: 94 (Optional) Language-specific overrides to default config variables. 95 For 4.0 traineddata files, lang.config provides control parameters which 96 can affect layout analysis, and sub-languages. 97 98lang.unicharset:: 99 (Required - 3.0x legacy tesseract) The list of symbols that Tesseract recognizes, with properties. 100 See unicharset(5). 101 102lang.unicharambigs:: 103 (Optional - 3.0x legacy tesseract) This file contains information on pairs of recognized symbols 104 which are often confused. For example, 'rn' and 'm'. 105 106lang.inttemp:: 107 (Required - 3.0x legacy tesseract) Character shape templates for each unichar. Produced by 108 mftraining(1). 109 110lang.pffmtable:: 111 (Required - 3.0x legacy tesseract) The number of features expected for each unichar. 112 Produced by mftraining(1) from *.tr* files. 113 114lang.normproto:: 115 (Required - 3.0x legacy tesseract) Character normalization prototypes generated by cntraining(1) 116 from *.tr* files. 117 118lang.punc-dawg:: 119 (Optional - 3.0x legacy tesseract) A dawg made from punctuation patterns found around words. 120 The "word" part is replaced by a single space. 121 122lang.word-dawg:: 123 (Optional - 3.0x legacy tesseract) A dawg made from dictionary words from the language. 124 125lang.number-dawg:: 126 (Optional - 3.0x legacy tesseract) A dawg made from tokens which originally contained digits. 127 Each digit is replaced by a space character. 128 129lang.freq-dawg:: 130 (Optional - 3.0x legacy tesseract) A dawg made from the most frequent words which would have 131 gone into word-dawg. 132 133lang.fixed-length-dawgs:: 134 (Optional - 3.0x legacy tesseract) Several dawgs of different fixed lengths -- useful for 135 languages like Chinese. 136 137lang.shapetable:: 138 (Optional - 3.0x legacy tesseract) When present, a shapetable is an extra layer between the character 139 classifier and the word recognizer that allows the character classifier to 140 return a collection of unichar ids and fonts instead of a single unichar-id 141 and font. 142 143lang.bigram-dawg:: 144 (Optional - 3.0x legacy tesseract) A dawg of word bigrams where the words are separated by a space 145 and each digit is replaced by a '?'. 146 147lang.unambig-dawg:: 148 (Optional - 3.0x legacy tesseract) . 149 150lang.params-model:: 151 (Optional - 3.0x legacy tesseract) . 152 153lang.lstm:: 154 (Required - 4.0 LSTM) Neural net trained recognition model generated by lstmtraining. 155 156lang.lstm-punc-dawg:: 157 (Optional - 4.0 LSTM) A dawg made from punctuation patterns found around words. 158 The "word" part is replaced by a single space. Uses lang.lstm-unicharset. 159 160lang.lstm-word-dawg:: 161 (Optional - 4.0 LSTM) A dawg made from dictionary words from the language. 162 Uses lang.lstm-unicharset. 163 164lang.lstm-number-dawg:: 165 (Optional - 4.0 LSTM) A dawg made from tokens which originally contained digits. 166 Each digit is replaced by a space character. Uses lang.lstm-unicharset. 167 168lang.lstm-unicharset:: 169 (Required - 4.0 LSTM) The unicode character set that Tesseract recognizes, with properties. 170 Same unicharset must be used to train the LSTM and build the lstm-*-dawgs files. 171 172lang.lstm-recoder:: 173 (Required - 4.0 LSTM) Unicharcompress, aka the recoder, which maps the unicharset 174 further to the codes actually used by the neural network recognizer. This is created as 175 part of the starter traineddata by combine_lang_model. 176 177lang.version:: 178 (Optional) Version string for the traineddata file. 179 First appeared in version 4.0 of Tesseract. 180 Old version of traineddata files will report Version:Pre-4.0.0. 181 4.0 version of traineddata files may include the network spec 182 used for LSTM training as part of version string. 183 184HISTORY 185------- 186combine_tessdata(1) first appeared in version 3.00 of Tesseract 187 188SEE ALSO 189-------- 190tesseract(1), wordlist2dawg(1), cntraining(1), mftraining(1), unicharset(5), 191unicharambigs(5) 192 193COPYING 194------- 195Copyright \(C) 2009, Google Inc. 196Licensed under the Apache License, Version 2.0 197 198AUTHOR 199------ 200The Tesseract OCR engine was written by Ray Smith and his research groups 201at Hewlett Packard (1985-1995) and Google (2006-present). 202