• Home
  • History
  • Annotate
Name Date Size #Lines LOC

..03-May-2022-

doc/H03-May-2022-20864

src/H01-Jan-2021-2,5382,015

tests/H01-Jan-2021-4,1953,799

.gitattributesH A D15-Apr-201931 21

.gitignoreH A D15-Apr-201980 1110

.gitlab-ci.ymlH A D21-Jun-20201 KiB4539

AUTHORSH A D15-Apr-20197.1 KiB254222

COPYINGH A D15-Apr-201934.3 KiB675553

ChangeLogH A D01-Jan-20216.8 KiB181145

MakefileH A D11-May-20191.4 KiB8767

PKG-INFOH A D01-Jan-20211.1 KiB2625

README.mdH A D01-Aug-20208.6 KiB337249

setup.cfgH A D01-Jan-202193 96

setup.pyH A D14-Aug-20192.9 KiB10286

tox.iniH A D03-Sep-2020180 1815

README.md

1# PyOCR
2
3PyOCR is an optical character recognition (OCR) tool wrapper for python.
4That is, it helps using various OCR tools from a Python program.
5
6It has been tested only on GNU/Linux systems. It should also work on similar
7systems (*BSD, etc). It may or may not work on Windows, MacOSX, etc.
8
9
10## Supported OCR tools
11
12* Libtesseract (Python bindings for the C API)
13* Tesseract (wrapper: fork + exec)
14* Cuneiform (wrapper: fork + exec)
15
16## Features
17
18* Supports all the image formats supported by [Pillow](https://github.com/python-imaging/Pillow),
19  including jpeg, png, gif, bmp, tiff and others
20* Various output types: text only, bounding boxes, etc.
21* Orientation detection (Tesseract and libtesseract only)
22* Can focus on digits only (Tesseract and libtesseract only)
23* Can save and reload boxes in hOCR format
24* PDF generation (libtesseract only)
25
26
27## Limitations
28
29* hOCR: Only a subset of the specification is supported. For instance, pages and
30  paragraph positions are not stored.
31
32
33## Installation
34
35```sh
36sudo pip3 install pyocr  # Python 3.X
37```
38
39or the manual way:
40```sh
41mkdir -p ~/git ; cd git
42git clone https://gitlab.gnome.org/World/OpenPaperwork/pyocr.git
43cd pyocr
44make install  # will run 'python ./setup.py install'
45```
46
47
48## Usage
49
50### Initialization
51
52```Python
53from PIL import Image
54import sys
55
56import pyocr
57import pyocr.builders
58
59tools = pyocr.get_available_tools()
60if len(tools) == 0:
61    print("No OCR tool found")
62    sys.exit(1)
63# The tools are returned in the recommended order of usage
64tool = tools[0]
65print("Will use tool '%s'" % (tool.get_name()))
66# Ex: Will use tool 'libtesseract'
67
68langs = tool.get_available_languages()
69print("Available languages: %s" % ", ".join(langs))
70lang = langs[0]
71print("Will use lang '%s'" % (lang))
72# Ex: Will use lang 'fra'
73# Note that languages are NOT sorted in any way. Please refer
74# to the system locale settings for the default language
75# to use.
76```
77
78### Image to text
79
80```Python
81txt = tool.image_to_string(
82    Image.open('test.png'),
83    lang=lang,
84    builder=pyocr.builders.TextBuilder()
85)
86# txt is a Python string
87
88word_boxes = tool.image_to_string(
89    Image.open('test.png'),
90    lang="eng",
91    builder=pyocr.builders.WordBoxBuilder()
92)
93# list of box objects. For each box object:
94#   box.content is the word in the box
95#   box.position is its position on the page (in pixels)
96#
97# Beware that some OCR tools (Tesseract for instance)
98# may return empty boxes
99
100line_and_word_boxes = tool.image_to_string(
101    Image.open('test.png'), lang="fra",
102    builder=pyocr.builders.LineBoxBuilder()
103)
104# list of line objects. For each line object:
105#   line.word_boxes is a list of word boxes (the individual words in the line)
106#   line.content is the whole text of the line
107#   line.position is the position of the whole line on the page (in pixels)
108#
109# Each word box object has an attribute 'confidence' giving the confidence
110# score provided by the OCR tool. Confidence score depends entirely on
111# the OCR tool. Only supported with Tesseract and Libtesseract (always 0
112# with Cuneiform).
113#
114# Beware that some OCR tools (Tesseract for instance) may return boxes
115# with an empty content.
116
117# Digits - Only Tesseract (not 'libtesseract' yet !)
118digits = tool.image_to_string(
119    Image.open('test-digits.png'),
120    lang=lang,
121    builder=pyocr.tesseract.DigitBuilder()
122)
123# digits is a python string
124```
125
126Argument 'lang' is optional. The default value depends of
127the tool used.
128
129Argument 'builder' is optional. Default value is
130builders.TextBuilder().
131
132If the OCR fails, an exception ```pyocr.PyocrException```
133will be raised.
134
135An exception MAY be raised if the input image contains no
136text at all (depends on the OCR tool behavior).
137
138
139### Orientation detection
140
141Currently only available with Tesseract or Libtesseract.
142
143```Python
144if tool.can_detect_orientation():
145    try:
146        orientation = tool.detect_orientation(
147            Image.open('test.png'),
148            lang='fra'
149        )
150    except pyocr.PyocrException as exc:
151        print("Orientation detection failed: {}".format(exc))
152        return
153    print("Orientation: {}".format(orientation))
154# Ex: Orientation: {
155#   'angle': 90,
156#   'confidence': 123.4,
157# }
158```
159
160Angles are given in degrees (range: [0-360[). Exact possible
161values depend of the tool used. Tesseract only returns angles =
1620, 90, 180, 270.
163
164Confidence is a score arbitrarily defined by the tool. It MAY not
165be returned.
166
167detect_orientation() MAY raise an exception if there is no text
168detected in the image.
169
170
171### Writing and reading text files
172
173Writing:
174
175```Python
176import codecs
177import pyocr
178import pyocr.builders
179
180tool = pyocr.get_available_tools()[0]
181builder = pyocr.builders.TextBuilder()
182
183txt = tool.image_to_string(
184    Image.open('test.png'),
185    lang=lang,
186    builder=builder
187)
188# txt is a Python string
189
190with codecs.open("toto.txt", 'w', encoding='utf-8') as file_descriptor:
191    builder.write_file(file_descriptor, txt)
192# toto.txt is a simple text file, encoded in utf-8
193```
194
195Reading:
196
197```Python
198import codecs
199import pyocr.builders
200
201builder = pyocr.builders.TextBuilder()
202with codecs.open("toto.txt", 'r', encoding='utf-8') as file_descriptor:
203    txt = builder.read_file(file_descriptor)
204# txt is a Python string
205```
206
207### Writing and reading hOCR files
208
209Writing:
210
211```Python
212import codecs
213import pyocr
214import pyocr.builders
215
216tool = pyocr.get_available_tools()[0]
217builder = pyocr.builders.LineBoxBuilder()
218
219line_boxes = tool.image_to_string(
220    Image.open('test.png'),
221    lang=lang,
222    builder=builder
223)
224# list of LineBox (each box points to a list of word boxes)
225
226with codecs.open("toto.html", 'w', encoding='utf-8') as file_descriptor:
227    builder.write_file(file_descriptor, line_boxes)
228# toto.html is a valid XHTML file
229```
230
231Reading:
232
233```Python
234import codecs
235import pyocr.builders
236
237builder = pyocr.builders.LineBoxBuilder()
238with codecs.open("toto.html", 'r', encoding='utf-8') as file_descriptor:
239    line_boxes = builder.read_file(file_descriptor)
240# list of LineBox (each box points to a list of word boxes)
241```
242
243
244### Generating PDF file from an image
245
246With libtesseract >= 4, it's possible to generate a PDF from an image:
247
248```Python
249import PIL.Image
250import pyocr
251
252image = PIL.Image.open("image.jpg")
253
254builder = pyocr.libtesseract.LibtesseractPdfBuilder()
255builder.add_image(image)    # multiple images are added as separate pages
256builder.set_lang("deu")     # optional
257builder.set_output_file("output_filename") # .pdf will be appended
258builder.build()
259```
260
261#### Add text layer to PDF
262
263```Python
264import pyocr
265import pdf2image
266
267images = pdf2image.convert_from_path("file.pdf", dpi=200, fmt='jpg')
268
269builder = pyocr.libtesseract.LibtesseractPdfBuilder()
270for image in images:
271    builder.add_image(image)
272builder.set_output_file("output") # .pdf will be appended
273builder.build()
274```
275
276Beware this code hasn't been adapted to libtesseract 3 yet.
277
278
279## Dependencies
280
281* PyOCR requires Python 3.4 or later.
282* You will need [Pillow](https://github.com/python-imaging/Pillow)
283  or Python Imaging Library (PIL). Under Debian/Ubuntu, Pillow is in
284  the package ```python-pil``` (```python3-pil``` for the Python 3
285  version).
286* Install an OCR:
287  * [libtesseract](http://code.google.com/p/tesseract-ocr/)
288    ('libtesseract3' + 'tesseract-ocr-<lang>' in Debian).
289  * or [tesseract-ocr](http://code.google.com/p/tesseract-ocr/)
290    ('tesseract-ocr' + 'tesseract-ocr-<lang>' in Debian).
291    You must be able to invoke the tesseract command as "tesseract".
292    PyOCR is tested with Tesseract >= 3.01 only.
293  * or Cuneiform
294
295
296## Tests
297
298```sh
299make check  # requires pyflake8
300make test  # requires tox, pytest and python3
301```
302
303Tests are made to be run without external dependencies (no Tesseract or Cuneiform needed).
304
305
306## OCR on natural scenes
307
308If you want to run OCR on natural scenes (photos, etc), you will have to filter
309the image first. There are many algorithms possible to do that. One of those
310who gives the best results is [Stroke Width
311Transform](https://gitlab.gnome.org/World/OpenPaperwork/libpillowfight#stroke-width-transformation).
312
313
314## Contact
315
316* [Forum](https://forum.openpaper.work/)
317* [Bug tracker](https://gitlab.gnome.org/World/OpenPaperwork/pyocr/issues)
318
319
320## Applications that use PyOCR
321
322* [Mayan EDMS](http://mayan-edms.com/)
323* [Paperless](https://github.com/danielquinn/paperless#readme)
324* [Paperwork](https://gitlab.gnome.org/World/OpenPaperwork/paperwork#readme)
325
326If you know of any other applications that use Pyocr, please
327[tell us](https://forum.openpaper.work/) :-)
328
329## Copyright
330
331PyOCR is released under the GPL v3+.
332Copyright belongs to the authors of each piece of code
333(see the file AUTHORS for the contributors list, and
334```git blame``` to know which lines belong to which author).
335
336https://gitlab.gnome.org/World/OpenPaperwork/pyocr
337