Questions tagged [tesseract]

Tesseract is an OCR (Optical Character Recognition) engine originally developed at HP Labs and now available as an open source library with development sponsored by Google.

Tesseract is an open source, multi-lingual OCR (Optical Character Recognition) engine originally developed at HP Labs. It is now sponsored by Google and licensed under the Apache license 2.0. It currently recognizes 107 languages. Tesseract is primarily written in C++ and C. The project is hosted at https://github.com/tesseract-ocr/tesseract and its support forums are found at http://groups.google.com/group/tesseract-ocr.

4350 questions
39
votes
5 answers

Using tesseract to recognize license plates

I'm developing an app which can recognize license plates (ANPR). The first step is to extract the licenses plates from the image. I am using OpenCV to detect the plates based on width/height ratio and this works pretty well: But as you can see,…
unicorn80
  • 1,107
  • 2
  • 9
  • 15
38
votes
6 answers

How do I segment a document using Tesseract then output the resulting bounding boxes and labels

I'm trying to get Tesseract to output a file with labelled bounding boxes that result from page segmentation (pre OCR). I know it must be capable of doing this 'out of the box' because of the results shown at the ICDAR competitions where contestants…
James Owers
  • 7,948
  • 10
  • 55
  • 71
36
votes
9 answers

What is the ideal font for OCR?

Does anybody have any experience with different fonts for OCR? I am generating an ID then trying to scan it with tesseract. At the moment I am just T&E'n different fonts, but this seems pretty inefficient. I've tried the OCR* family of fonts, and…
Chris Lloyd
  • 12,100
  • 7
  • 36
  • 32
36
votes
6 answers

Preprocessing image for Tesseract OCR with OpenCV

I'm trying to develop an App that uses Tesseract to recognize text from documents taken by a phone's cam. I'm using OpenCV to preprocess the image for better recognition, applying a Gaussian blur and a Threshold method for binarization, but the…
Mauricio
  • 839
  • 2
  • 13
  • 26
35
votes
6 answers

Recognize a number from an image

I'm trying to write an application to find the numbers inside an image and add them up. How can I identify the written number in an image? There are many boxes in the image I need to get the numbers in the left side and sum them to give total. How…
Hash
  • 7,726
  • 9
  • 34
  • 53
33
votes
9 answers

Tesseract OCR simple example

Hi Can you anyone give me a simple example of testing Tesseract OCR preferably in C#. I tried the demo found here. I download the English dataset and unzipped in C drive. and modified the code as followings: string path =…
Will Robinson
  • 631
  • 2
  • 6
  • 11
33
votes
6 answers

Using Tesseract from java

I'm trying to build a sample application in java that will read an image file and just output the text extracted from the image. I found the Tesseract project which seems promising, however, its in c++. In order to use it, should I simply run it as…
Omnipresent
  • 29,434
  • 47
  • 142
  • 186
32
votes
5 answers

OCR with the Tesseract interface

How do you OCR an tiff file using Tesseract's interface in c#? Currently I only know how to do it using the executable.
toh yen cheng
  • 345
  • 1
  • 4
  • 5
32
votes
2 answers

Which OCR Engine is better: Tesseract or OCRopus?

I have tried Tesseract with iPhone and assessed its accuracy to be 70% without image preprocessing. I also noticed that it might be poor in extracting digits. I have heard about OCRopus OCR engine: which is better, Tesseract or OCRopus, in terms of…
Ahmed Hussein
  • 442
  • 2
  • 6
  • 12
30
votes
2 answers

What OCR options exist beyond Tesseract?

I've used Tesseract a bit and it's results leave much to be desired. I'm currently detecting very small images (35x15, without border, but have tried adding one with imagemagick with no ocr advantage); they range from 2 chars to 5 and are a pretty…
ylluminate
  • 12,102
  • 17
  • 78
  • 152
30
votes
3 answers

Tesseract training for a new font

I'm still new to Tesseract OCR and after using it in my script noticed it had a relatively big error rate for the images I was trying to extract text from. I came across Tesseract training, which supposedly would be able to decrease error rate for a…
user19235
  • 591
  • 1
  • 4
  • 7
30
votes
2 answers

How can I run tesseract with multiple languages one time?

I have to analyzed a image which containing both English and Japanese texts. When I run tesseract by default (-l eng), some Japanese characters lost. Otherwise, if I run tesseract with japanese (-l jpn) some English characters lost (e.g. Email). How…
pars
  • 409
  • 1
  • 5
  • 10
29
votes
5 answers

How to install language in tesseract OCR

I have installed tesseract OCR and it has only 'eng' and 'osd' in the language list. I need german language. I tired following command brew install tesseract-ocr-deu but i am getting error. Error: No available formula with the name…
Lama Madan
  • 617
  • 1
  • 10
  • 22
29
votes
4 answers

Removing horizontal underlines

I am attempting to pull text from a few hundred JPGs that contain information on capital punishment records; the JPGs are hosted by the Texas Department of Criminal Justice (TDCJ). Below is an example snippet with personally identifiable…
Brad Solomon
  • 38,521
  • 31
  • 149
  • 235
29
votes
2 answers

Can `tesseract-ocr` put the result to STDOUT?

Using tesseract-ocr #3.02.02. The basic usage of tesseract is tesseract sourc.png result and result.txt is generated. To get the result text, I have to cat this file. Is there any options to dump the result in stdout?
otiai10
  • 4,289
  • 5
  • 38
  • 50