Questions tagged [tesseract]

Tesseract is an OCR (Optical Character Recognition) engine originally developed at HP Labs and now available as an open source library with development sponsored by Google.

Tesseract is an open source, multi-lingual OCR (Optical Character Recognition) engine originally developed at HP Labs. It is now sponsored by Google and licensed under the Apache license 2.0. It currently recognizes 107 languages. Tesseract is primarily written in C++ and C. The project is hosted at https://github.com/tesseract-ocr/tesseract and its support forums are found at http://groups.google.com/group/tesseract-ocr.

4350 questions
15
votes
1 answer

Explicitly set the font to be used for recognition by Tesseract-OCR

I have documents which use only one font throughout the document. Different documents might have different fonts, but I know which document uses which font. Is there an option to explicitly tell Tesseract-OCR which font to use during recognition for…
sashoalm
  • 75,001
  • 122
  • 434
  • 781
14
votes
1 answer

Custom Dictionary for Tesseract

I am currently working on a project for android using Tesseract OCR. I was hoping to fine-tune the results given to the user by adding a dictionary. According to tesseract OCR wiki, the best way to go about this would be to Replace…
TomSelleck
  • 6,706
  • 22
  • 82
  • 151
14
votes
2 answers

7-Segment Display OCR

I'm building an iOS application (take a picture and run OCR on it) using Tesseract (an OCR library) and it is working very well with well written numbers and characters (using usual fonts). The problem I am having is that if I try it on a 7-Segment…
Karim
  • 5,298
  • 3
  • 29
  • 35
14
votes
4 answers

Page layout analysis using Tesseract?

Tesseract 3 is able to perform page layout analysis. However, I couldn't find any sample code or documentation on how to use the library for such purposes. I hope someone here can explain how to perform layout analysis on an image and how to parse…
Pedro
  • 4,100
  • 10
  • 58
  • 96
14
votes
3 answers

Image preprocessing with OpenCV before doing character recognition (tesseract)

I'm trying to develop simple PC application for license plate recognition (Java + OpenCV + Tess4j). Images aren't really good (in further they will be good). I want to preprocess image for tesseract, and I'm stuck on detection of license plate…
14
votes
5 answers

How does one install Tesseract-OCR 3.03 in Ubuntu/Linux distributions?

A friend and I are interested in training the tesseract-OCR engine for a CV project. We tried using some wrappers such as PyTesser and pyocr, but the results are currently not as accurate as we need them to be. As such, we want to try training the…
greenteawarrior
  • 191
  • 1
  • 1
  • 8
14
votes
2 answers

Open-CV - Not loading correctly

I'm using Ubuntu 14.04 and I'm trying to compile this code, but I get these errors no matter what, I believe it has something to do with including the OpenCV library, but I'm not sure. Could anyone help me out? Errors: main.cc:66:37: error:…
Bernardo Meurer
  • 2,295
  • 5
  • 31
  • 52
14
votes
1 answer

Convert scanned pdf to .txt files using tesseract

I have to convert a .pdf file containing scanned images into .txt files. The tesseract ocr converts only images to .txt, but I need to first extract the .tif images and then convert it. Can anyone help me with this?
Ganesh Nannaware
  • 307
  • 1
  • 3
  • 11
14
votes
2 answers

"Adding" new fonts to Tesseract eng.traineddata

As far as I know, Tesseract 3.x comes with 6 English (correct me if I'm wrong) fonts. I need to train Tesseract for more 5 types of fonts. I need only capital letters and digits (no special characters or symbols). I followed various processes for…
md1hunox
  • 3,815
  • 10
  • 45
  • 67
14
votes
4 answers

Python error when importing image_to_string from tesseract

I recently used tesseract OCR with python and I kept getting an error when I was trying to import image_to_string from tesseract. Code causing the problem: # Perform OCR using tesseract-ocr library from tesseract import image_to_string image =…
digital_alchemy
  • 663
  • 4
  • 9
  • 19
14
votes
3 answers

OCR: Image to text?

Before mark as copy or repeat question, please read the whole question first. I am able to do at pressent is as below: To get image and crop the desired part for OCR. Process the image using tesseract and leptonica. When the applied document is…
The iOSDev
  • 5,237
  • 7
  • 41
  • 78
13
votes
1 answer

python-tesseract OCR: get digits only

I'm using tesseract OCRwith python-tesseract. In the tesseract FAQ, regarding digits, we have: Use TessBaseAPI::SetVariable("tessedit_char_whitelist", "0123456789"); BEFORE calling an Init function or put this in a text file called …
jpimentel
  • 694
  • 1
  • 7
  • 23
13
votes
1 answer

Is there any way to improve tesseract OCR with small fonts?

I'm trying to use tesseract-OCR via python-tesseract to read a low resolution font that looks like this: Unfortunately that image returns ZIJZHZI I think the resolution is too low and that is causing problems. I've tried magnifying the image,…
Riazm
  • 365
  • 2
  • 4
  • 10
13
votes
3 answers

Tesseract installation in windows

I am currently working on optimal character recognition project using python 2.7,open computer vision in windows.To accomplish this task i came to know that it can be done by using tesseract (software).But, it cannot be installed on windows. I…
zeeshan
  • 131
  • 1
  • 1
  • 5
13
votes
1 answer

Increase Accuracy of text recognition through pytesseract & PIL

So I am trying to extract text from image. And as the quality and size of image is not good, it is giving inaccurate results. I tried few enhancements and other things with PIL but that is only worsening the quality of image. Can someone suggest…
sprksh
  • 2,204
  • 2
  • 26
  • 43