1WORDLIST2DAWG(1)
2================
3:doctype: manpage
4
5NAME
6----
7wordlist2dawg - convert a wordlist to a DAWG for Tesseract
8
9SYNOPSIS
10--------
11*wordlist2dawg* 'WORDLIST' 'DAWG' 'lang.unicharset'
12
13*wordlist2dawg* -t 'WORDLIST' 'DAWG' 'lang.unicharset'
14
15*wordlist2dawg* -r 1 'WORDLIST' 'DAWG' 'lang.unicharset'
16
17*wordlist2dawg* -r 2 'WORDLIST' 'DAWG' 'lang.unicharset'
18
19*wordlist2dawg* -l <short> <long> 'WORDLIST' 'DAWG' 'lang.unicharset'
20
21DESCRIPTION
22-----------
23wordlist2dawg(1) converts a wordlist to a Directed Acyclic Word Graph
24(DAWG) for use with Tesseract.  A DAWG is a compressed, space and time
25efficient representation of a word list.
26
27OPTIONS
28-------
29-t
30	Verify that a given dawg file is equivalent to a given wordlist.
31
32-r 1
33	Reverse a word if it contains an RTL character.
34
35-r 2
36	Reverse all words.
37
38-l <short> <long>
39	Produce a file with several dawgs in it, one each for words
40	of length <short>, <short+1>,... <long>
41
42ARGUMENTS
43---------
44
45'WORDLIST'
46	A plain text file in UTF-8, one word per line.
47
48'DAWG'
49	The output DAWG to write.
50
51'lang.unicharset'
52	The unicharset of the language. This is the unicharset
53	generated by mftraining(1).
54
55SEE ALSO
56--------
57tesseract(1), combine_tessdata(1), dawg2wordlist(1)
58
59<https://tesseract-ocr.github.io/tessdoc/Training-Tesseract.html>
60
61COPYING
62-------
63Copyright \(C) 2006 Google, Inc.
64Licensed under the Apache License, Version 2.0
65
66AUTHOR
67------
68The Tesseract OCR engine was written by Ray Smith and his research groups
69at Hewlett Packard (1985-1995) and Google (2006-present).
70