Questions tagged [tesseract]

Tesseract is an OCR (Optical Character Recognition) engine originally developed at HP Labs and now available as an open source library with development sponsored by Google.

Tesseract is an open source, multi-lingual OCR (Optical Character Recognition) engine originally developed at HP Labs. It is now sponsored by Google and licensed under the Apache license 2.0. It currently recognizes 107 languages. Tesseract is primarily written in C++ and C. The project is hosted at https://github.com/tesseract-ocr/tesseract and its support forums are found at http://groups.google.com/group/tesseract-ocr.

4350 questions
28
votes
4 answers

Tesseract ocr PDF as input

I am building an OCR project and I am using a .Net wrapper for Tesseract. The samples that the wrapper have don't show how to deal with a PDF as input. Using a PDF as input how do I produce a searchable PDF using c#? I have use ghostscript library…
acrab
  • 319
  • 1
  • 3
  • 5
27
votes
3 answers

How to find parameters supported in Tesseract OCR config file

I want to know what parameters the config file used by Tesseract OCR accepts, how to write a config file, etc. I can't find any documentation about this on their site. How can I determine what parameters are supported, and what they mean?
sashoalm
  • 75,001
  • 122
  • 434
  • 781
26
votes
1 answer

How do I train tesseract 4 with image data instead of a font file?

I'm trying to train Tesseract 4 with images instead of fonts. In the docs they are explaining only the approach with fonts, not with images. I know how it works, when I use a prior version of Tesseract but I didn't get how to use the box/tiff…
claim
  • 506
  • 6
  • 13
26
votes
3 answers

How can I use async to increase WinForms performance?

i was doing some processor heavy task and every time i start executing that command my winform freezes than i cant even move it around until the task is completed. i used the same procedure from microsoft but nothing seem to be changed. my working…
Serak Shiferaw
  • 993
  • 2
  • 11
  • 32
25
votes
5 answers

Best way to recognize characters in screenshot?

What would you recommend for recognizing all characters from a screenshot? The screenshot is perfectly clear (only black text on a white background), also I can choose any standard font for the text (installed on Windows). I have tried some OCR ways…
Tomek
  • 251
  • 1
  • 3
  • 4
25
votes
2 answers

Tesseract traineddata not working in Swift 3.0 project using version 4.0

I'm attempting to use Tesseract-OCR-iOS in a new Swift 3.0 project. I'm using Xcode Version 8.1 (8B62). CocoaPods is version 1.1.1. When I attempt to use tesseract.recognize(), my app crashes and I get the following output in the…
Adrian
  • 16,233
  • 18
  • 112
  • 180
25
votes
5 answers

How to preserve document structure in tesseract

I am using tesseract ocr to extract text from an image. Preserving the structure of the document is very important to me. Currently tesseract does not preserve the structure, infact it changes the order of text. My input is the image below. and the…
Sar009
  • 2,166
  • 5
  • 29
  • 48
25
votes
4 answers

Tesseract Trained data

Am trying to extract data from reciepts and bills using Tessaract , am using tesseract 3.02 version . am using only english data , Still the output accuracy is about 60%. Is there any trained data available which i just replace in tessdata folder
nicky
  • 3,810
  • 9
  • 35
  • 44
24
votes
5 answers

Why Tesseract OCR library (iOS) cannot recognize text at all?

I'm trying to use Tesseract OCR library in my iOS application. I downloaded tesseract-ios library from github and when I tried to recognize a simple text image I got garbage instead. Here is an image of what I tried to recognize: I got unreadable…
MainstreamDeveloper00
  • 8,436
  • 15
  • 56
  • 102
23
votes
4 answers

Tesseract does not recognize single characters

How to represent: Create new image with paint (any size) Add letter A to this image Try to recognize -> tesseract will not find any letters Copy-paste this letter 5-6 times to this image Try to recognize -> tesseract will find all the letters Why?
artem
  • 16,382
  • 34
  • 113
  • 189
23
votes
5 answers

Tesseract OCR on AWS Lambda via virtualenv

I have spent all week attempting this, so this is a bit of a hail mary. I am attempting to package up Tesseract OCR into AWS Lambda running on Python (I am also using PILLOW for image pre-processing, hence the choice of Python). I understand how to…
Andy G
  • 820
  • 1
  • 8
  • 11
23
votes
2 answers

Tesseract receipt scanning advice needed

I have struggled off and on again with Tesseract for various OCR projects and I found a use case today which I thought would be a slam dunk for it but after many hours I am still coming away unsatisfied. I wanted to pose the problem here and see…
Jim Sanders
  • 521
  • 1
  • 4
  • 10
23
votes
2 answers

how to avoid Permission denied while installing package for Python without sudo

I am trying to install the tesseract wrapper for python as user mike so that I can import tesseract. I'm following the guide here https://code.google.com/p/python-tesseract/wiki/HowToCompilePythonTesseractForCentos However, when I execute python…
Anthony
  • 33,838
  • 42
  • 169
  • 278
23
votes
2 answers

Improving OCR performance on multi-paragraph scans

I'm working on a project that involves extracting text scientific papers stored in PDF format. For most papers, this is accomplished quite easily using PDFMiner, but some older papers store their text as large images. In essence, a paper is…
Louis Thibault
  • 20,240
  • 25
  • 83
  • 152
22
votes
2 answers

Tesseract OCR confuses slashed 0 as 8

I have trained tesseract on the terminus font, but no matter what, I can't get it to recognize the 0s. I am using the jTessEditor to create the training tif and boxes. Even when validating, it reads all 0s as 8s. Is there anything I am missing? Here…
Vilsol
  • 722
  • 1
  • 7
  • 17