1UNICHARSET_EXTRACTOR(1) 2======================= 3 4NAME 5---- 6unicharset_extractor - Reads box or plain text files to extract the unicharset. 7 8SYNOPSIS 9-------- 10*unicharset_extractor* [--output_unicharset filename] [--norm_mode mode] box_or_text_file [...] 11 12Where mode means: 13 1=combine graphemes (use for Latin and other simple scripts) 14 2=split graphemes (use for Indic/Khmer/Myanmar) 15 3=pure unicode (use for Arabic/Hebrew/Thai/Tibetan) 16 17DESCRIPTION 18----------- 19Tesseract needs to know the set of possible characters it can output. 20To generate the unicharset data file, use the unicharset_extractor 21program on training pages bounding box files or a plain text file: 22 23 unicharset_extractor fontfile_1.box fontfile_2.box ... 24 25The unicharset will be put into the file './unicharset' if no output filename is provided. 26 27*NOTE* Use the appropriate norm_mode based on the language. 28 29SEE ALSO 30-------- 31tesseract(1), unicharset(5) 32 33<https://tesseract-ocr.github.io/tessdoc/Training-Tesseract.html> 34 35HISTORY 36------- 37unicharset_extractor first appeared in Tesseract 2.00. 38 39COPYING 40------- 41Copyright \(C) 2006, Google Inc. 42Licensed under the Apache License, Version 2.0 43 44AUTHOR 45------ 46The Tesseract OCR engine was written by Ray Smith and his research groups 47at Hewlett Packard (1985-1995) and Google (2006-present). 48