Questions tagged [tesseract]

Tesseract is an OCR (Optical Character Recognition) engine originally developed at HP Labs and now available as an open source library with development sponsored by Google.

Tesseract is an open source, multi-lingual OCR (Optical Character Recognition) engine originally developed at HP Labs. It is now sponsored by Google and licensed under the Apache license 2.0. It currently recognizes 107 languages. Tesseract is primarily written in C++ and C. The project is hosted at https://github.com/tesseract-ocr/tesseract and its support forums are found at http://groups.google.com/group/tesseract-ocr.

4350 questions
17
votes
2 answers

Can I use OCR to detect font style (bold, italic)?

I am interested in using OCR to extract bold and italic words from a simple text. For example, if I input a clear image with text like so: "The quick brown fox jumps over the lazy dog." I would like to get an output like so: bold("brown", "jumps"),…
vamin
  • 2,178
  • 6
  • 26
  • 30
17
votes
4 answers

pytesseract using tesseract 4.0 numbers only not working

Any one tried to get numbers only calling the latest version of tesseract 4.0 in python? The below worked in 3.05 but still returns characters in 4.0, I tried removing all config files but the digits file and still didn't work; any help would be…
CuriousGeorge
  • 301
  • 1
  • 2
  • 6
17
votes
1 answer

Pytesseract set character whitelist

Does anyone know how to set the character whitelist for Pytesseract? I want it to only output A-z and 0-9. Is this possible? I have the following: img = Image.open('test.jpg') result = pytesseract.image_to_string(img, config='-psm 6') I'm getting…
Minato10
  • 173
  • 1
  • 1
  • 4
17
votes
1 answer

How to extract text from a directory of PDF files efficiently with OCR?

I have a large directory with PDF files (images), how can I extract efficiently the text from all the files inside the directory?. So far I tried to: import multiprocessing import textract def extract_txt(file_path): text =…
john doe
  • 2,233
  • 7
  • 37
  • 58
17
votes
1 answer

Tesseract user-patterns

Any one know how to use the user patterns (user_patterns_suffix) in Tesseract? Could you advise me how to do with it and how to test it working? I tried to follow Tesseract guide (Tesseract user-patterns but I didn't see it affected the result at…
kha nguyen
  • 181
  • 1
  • 2
  • 5
17
votes
2 answers

Suggestions for digit recognition

I'm writing an Android app to extract a Sudoku puzzle from a picture. For each cell in the 9x9 Sudoku grid, I need to determine whether it contains one of the digits 1 through 9 or is blank. I start off with a Sudoku like this: I pre-process the…
1''
  • 26,823
  • 32
  • 143
  • 200
16
votes
2 answers

Alternative to Tesseract OCR Training?

For the past 3 months I've been trying to train the Tesseract With identifying a collection of images I've had, due a real lack of proper documentation, and very high level of complexity I'm starting to give up on Tesseract as a solution. I'm…
Asaf
  • 8,106
  • 19
  • 66
  • 116
16
votes
7 answers

Tesseract OCR Library - Learning Font

Well I'm using a complied .NET version of this OCR which can be found @ http://www.pixel-technology.com/freeware/tessnet2/ I have it working, however the aim of this is to translate license plates, sadly the engine really doesn't accurately…
Ash
  • 3,494
  • 12
  • 35
  • 42
16
votes
2 answers

Doing OCR with R

I have been trying to do OCR within R (reading PDF data which data as scanned image). Have been reading about this @ http://electricarchaeology.ca/2014/07/15/doing-ocr-within-r/ This a very good post. Effectively 3 steps: convert pdf to ppm (an…
anshuk_pal
  • 195
  • 1
  • 8
16
votes
1 answer

Improve Tesseract OCR results with blurred text

I am working on OCR recognition of printed text. In particular I am focusing on the preprocessing step to improve the results of the Tesseract engine. I have already obtained good results with adaptive thresholding, noise removal, text deskew,…
Marco Ancona
  • 2,073
  • 3
  • 22
  • 37
16
votes
2 answers

iOS Tesseract OCR Image Preperation

I would like to implement an OCR application that would recognize text from Photos. I succeeded in Compiling and Integration the Tesseract Engine in iOS, I succeeded in getting reasonable detection when photographing clear documents (or a photoshot…
alandalusi
  • 1,145
  • 4
  • 18
  • 39
16
votes
1 answer

Image processing for OCR with leptonica (inverse color text)

I am trying to process the following image with leptonica to extract text with tesseract. Original Image: Tesseract on the original image yields this: i s l D2J1FiiE-l191x1iitmwii9 uhiaiislz-2 Q ~37 Bottom linez With a little time! you can learn…
jasonlfunk
  • 5,159
  • 4
  • 29
  • 39
15
votes
1 answer

Difference between Tesseract 3 and Tesseract 4?

What are the major differences between Tesseract 3 and Tesseract 4 ? And why should I choose one over the other ?
F.Lin
  • 333
  • 1
  • 3
  • 12
15
votes
1 answer

Where is the default tesseract installation folder on a mac?

I've just installed tesseract through homebrew, I need to put some files inside the tessdata folder but I can't find it anywhere on my mac. I searched for "tesseract" in the finder and the search returned nothing, I couldn't find anything on google…
Barbara
  • 12,908
  • 6
  • 32
  • 43
15
votes
1 answer

Why am I getting "tiff page 1 not found" Lebtonica warning in Tesseract?

I just started using Tesseract. I am following the instructions described here. I have created a test image like this: training/text2image --text=test.txt --outputbase=eng.Arial.exp0 --font='Arial' --fonts_dir=/usr/share/fonts Now I want to train…