• Home
  • History
  • Annotate
Name Date Size #Lines LOC

..03-May-2022-

bin/H03-May-2022-674471

doc/H03-May-2022-395301

examples/H03-May-2022-665497

include/H20-Sep-2018-7938

lib/H24-Sep-2010-

man/H28-Sep-2018-230197

src/H03-May-2022-24,72918,349

AUTHORSH A D23-Feb-2004243 85

BUGSH A D20-Sep-20182.3 KiB5748

CREDITSH A D24-Nov-2000784 1815

HISTORYH A D15-Oct-201815.1 KiB298278

INSTALLH A D15-Aug-20092.4 KiB9057

MakefileH A D06-Oct-20185.9 KiB199114

Makefile.inH A D28-Sep-20186 KiB199115

READMEH A D25-Sep-20187.1 KiB168140

REVIEWH A D05-Jan-200420.9 KiB539472

TODOH A D05-Mar-20133.6 KiB8074

configureH A D15-Oct-2018141.4 KiB4,9224,050

configure.inH A D20-Sep-20182.6 KiB7262

gocr.specH A D20-Sep-20184.1 KiB14461

install-shH A D19-Nov-200068 42

make.batH A D31-Oct-20061.7 KiB5857

README

1			 GOCR (JOCR at SF.net)
2
3GOCR is an optical character recognition program, released under the
4GNU General Public License. It reads images in many formats  and outputs
5a text file. Possible image formats are pnm, pbm, pgm, ppm, some pcx and
6tga image files. Other formats like pnm.gz, pnm.bz2, png, jpg, tiff, gif,
7bmp will be automatically converted using the netpbm-progs, gzip and bzip2
8via unix pipe.
9A simple graphical frontend written in tcl/tk and some
10sample files are included.
11Gocr is also able to recognize and translate barcodes.
12You do not have to train the program or store large font bases.
13Simply call gocr from the command line and get your results.
14
15To see installation instructions, see the INSTALL file.
16
17How to start? (QUICK START)
18---------------------------
19Some examples of how you can use gocr:
20
21  gocr -h       			# help
22  gocr file.pbm				# minimum options
23  gocr -v 1 file.pbm >out.txt 2>out.log # generate text- and log file
24  djpeg -pnm -gray text.jpg | gocr -	# using JPEG-files
25  gzip -cd text.pbm.gz | gocr -		# using gzipped PBM-files
26  giftopnm text.gif | gocr -		# using GIF-files
27  gocr -v 1 -v 32 -m 4 file.pbm   	# zoning and out30.bmp output
28  xloadimage -geometry 400x400 out30.png # see details using image viewer
29  gocr -f XML -i file.pgm -o file.xml    # output simple XML-format
30  wish gocr.tcl			  # X11-tcl/tk-frontend (development version)
31  # see manual pages for more details
32
33
34How to get image files?
35-----------------------
36Scan text pages and save it as PGM/PBM/PNM file. Use a program such as
37The GIMP or Sane. You can also use netpbm-progs to convert several image
38formats into PGM/PBM/PNM. The tool djpeg can be used to convert jpeg into pgm.
39If you have a POSIX compatible system like linux and PNM-tools, gzip and bzip2
40are installed, you are lucky and gocr will do conversion
41from [.pnm.gz, .pnm.bz2, .jpg, .jpeg, .bmp, .tiff, .png, .ps, .eps]
42to [.pgm] for you. This list can easily be extended editing src/pnm.c.
43
44Gocr also comes with some examples, try: make examples.
45
46Memory limitations
47------------------
48WARNING!!!
49
50If you use a 300dpi scan of A4 letter, the image is about 2500x3500 pixels and
51gocr requires 8.75MB for storing the picture into the memory. Not only that,
52but gocr may create a 2nd copy, using a total of 17MB. This is independent
53of using b/w or gray-scale images. Be sure that you have enough RAM installed
54in your machine! Alternatively you can cut the picture into small pieces.
55You can use the pnmcut, from the netpbm package to cut the file. Example:
56
57pnmcut -left 0 -right 2500 -top 0 -height 1000 bigfile.pnm > smallfile.pnm
58
59And then use gocr in the cropped image as usual. Take care: if you chopp the
60characters, gocr won't be able to understand that line.
61
62Future versions will take care of this issue automatically.
63
64Limitations
65-----------
66gocr is still in its early stages. Your images should fit in these requirements
67if you want a good output:
68
69- good scans (all chars well seperated, one column, no tables etc, 12pt 300dpi)
70  should work well
71- fonts 20-60 pixels ( 5pt * 1in/72pt * 300 dpi = 20 dots )
72- output of image file for controlling detection
73
74And note that speed is very slow (this will be changed when recognition works
75well)
76  12pt 300dpi 1700x950 16lines 700chars 22x28 P90=40s..90s v0.2.3 (gcc -O0)
77
78You can try to optimize the results:
79- make good scans/treat image
80- try to change the critical gray level (option -l <n>)
81- control the result on out10.png, out30.png (option -v 32)
82  example: ./gocr -v 32 -m 4 -m 256 -m 56 ~/aac.jpg # only check layout
83- enlarge option -d <n> for high resolution images which are noisy
84- try different combinations for option -m <n>
85- for thousends of documents with same font
86  you can use/create a database (-m 2/-m 130)
87- use options -d 0 -m 8 on screen shots (font8x12)
88- use filter option -C to through out wrong recognized chars (ex: gothic)
89
90What does >> NOT << work at the moment:
91- complex layouts (try option -m 4)
92- bad scans, noisy/snowy images, FAX-quality images
93- serif fonts, italic fonts, slanted fonts
94- handwritten texts (this is valid for the next ten years I guess)
95   the exisctence of autotrace can shorten this
96- rotated images (but slightly rotated images should be no problem)
97- small fonts (fax like) or mix of different font size
98- colored images (use gray or black/white)
99- Chinese, Arabian, Egyptian, Cyrillic or Klingon fonts
100
101How it works or how it should work?
102- put the entire file into RAM (300dpi grayscale recommended)
103- remove dust and snow
104- detect small angle (lines which are not horizontal)
105- detect text boxes (option -m 4)
106- detect text-lines
107- detect characters
108- first step recognition (every character has its own empirical procedure)
109  - no neural network or similar general algorithms
110- analyze not detected chars by comparison with detected ones
111- try to divide overlapping chars
112- testwise: compare all letters (like compression of pictures)
113- for more details look to the gocr.html documentation
114
115Why the result of the new version are worse compared to the old version?
116- the algorithms of gocr are sometimes evolutionary, a fine tuned old
117  algorithm will be replased by a completly new algorithm which is more
118  general but a bit worse for your problem.
119  Please send your sample and give the new algo the chance to become
120  better as the old one.
121
122Security
123--------
124Because gocr only reads and writes files it is quite sure, except the
125popen-function which allows you to call gocr with non-pnm-image formats
126directly. The popen function can be misused to start other probably
127dangerous programs.
128If you care about conversion to pnm format, you can safely disable
129popen-function by removing "#define HAVE_POPEN 1" from config.h before
130compiling the gocr package.
131
132
133How can you help gocr?
134----------------------
135- Send comments, ideas and patches (diff -ru gocr_original/ gocr_changed/).
136- If you found a bug, i.e. clear characters not recognized,
137  crop the area around the problem of about 3 text lines and
138  send as format png or jpeg for fotos, small images are easier
139  for debugging and will go to my check-database.
140- I always need small example files (.pbm.gz, png or jpeg)
141  of maximum 100kB for testing
142  the behavior of the ocr engine under different conditions,
143  because scanning does take a lot of time which I do not have.
144  Please only use free image formats: .pbm.gz (b/w), png (screenshots,
145  converted pdfs) or jpeg (optical scans, photos).
146  That will help, to get the world's best OCR open source program. :) Thanks!
147- Please dont send captchas. GOCR is mainly intended to make books
148  and knowledge easier accessible.
149- If you have a good idea, how to manage some OCR-tasks, tell me!
150- If you have a lot of money, spend a bit (paypal). Ok, paypal has changed,
151  so please forget about it. Also the importand problem is now the missing
152  spare time for coding.
153
154
155After all, is it gocr or jocr?
156------------------------------
157The original name of this project is gocr, from GPLed Optical Character
158Recognition. Another project is using the same name, however; so the
159name was changed to jocr. If you have a good idea for a name, please
160send it.
161
162
163Latest news
164------------
165  http://www-e.uni-magdeburg.de/jschulen/ocr/
166
167Authors: (see AUTHORS)
168