Questions tagged [python-tesseract]

Python-tesseract is a wrapper class for Tesseract OCR that allows any conventional image files (JPG, GIF, PNG, TIFF, etc.) to be read and get its text, data of text, or even convert it to pdf.

Python-tesseract is a wrapper class for OCR that allows any conventional image files (JPG, GIF, PNG, TIFF, etc.) to be read and decoded into usable text.

Tesseract is advertised as the most accurate open source OCR engine available. It was developed at HP Labs between 1985 and 1995 and then remained dormant until 2006 when Google revived the project.

For more information, please see the Python-tesseract page or the Tesseract page.

1664 questions
8
votes
3 answers

Getting an error when using the image_to_osd method with pytesseract

Here's my code: import pytesseract import cv2 from PIL import Image pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files (x86)\Tesseract-OCR\tesseract.exe" def main(): original = cv2.imread('D_Testing.png', 0) # binary thresh it at…
Bob Stoops
  • 151
  • 2
  • 12
8
votes
1 answer

Preserving Spaces in Tesseract

I had an image file, which contain some text separated by tabs (2 spaces). But when I extract text out of this image file, I always get a single space between two columns. A sample example: IMAGE: col-a col-b col-c Desired output: col-a …
raghu
  • 384
  • 7
  • 10
8
votes
3 answers

error while trying to install tesserocr

I keep getting the same error when I try to install (env) vagrant@vagrant:~$ pip install tesserocr Collecting tesserocr Using cached tesserocr-2.1.3.tar.gz Building wheels for collected packages: tesserocr Running setup.py bdist_wheel for…
KSar
  • 101
  • 1
  • 2
  • 5
7
votes
2 answers

Tesseract - unable to recognize Greek letters at all

I am trying to automatically extract a scale (scale bar + a number + unit) from an image. Here is an example: It is used to map pixels to real world measurement. I am using PyTesseract (installed through Anaconda3). Here is my code: import…
rbaleksandar
  • 8,713
  • 7
  • 76
  • 161
7
votes
3 answers

Applying user patterns in pytesseract

I'm using pytesseract to try to detect certain pattern of strings in images. As far as I understand, the correct use of user patterns will help pytesseract make a better scan for a certain pattern of string. However, I can't figure out how to put…
aabujamra
  • 4,494
  • 13
  • 51
  • 101
7
votes
3 answers

How to convert PDF into image readable by opencv-python?

I am using following code to draw rectangle on an image text for matching date pattern and its working fine. import re import cv2 import pytesseract from PIL import Image from pytesseract import Output img = cv2.imread('invoice-sample.jpg') d =…
P.Natu
  • 131
  • 1
  • 3
  • 12
7
votes
2 answers

Does anyone knows the meaning of output of image_to_data, image_to_osd methods of pytesseract?

I'm trying to extract the data from image using pytesseract. This module has image_to_data, image_to_osd methods. These two methods provides lot of info(TextLineOrder, WritingDirection, ScriptDetection, Orientation etc...) as output. Below image is…
Eswar RDS
  • 351
  • 1
  • 3
  • 11
7
votes
2 answers

tesseract 5.0 bazaar + user-words config doesn't work

I tried to force tesseract to use only my words list when perform OCR. First, i copy bazaar file to /usr/share/tesseract-ocr/5/tessdata/configs/. This is my bazaar file: load_system_dawg F load_freq_dawg F user_words_suffix user-words Then, i…
voxter
  • 853
  • 2
  • 14
  • 30
7
votes
2 answers

Why can't get string with PIL and pytesseract?

It is a simple Optical Character Recognition (OCR) program in Python 3 to get string, I have uploaded the target gif file here, please download it and save it as /tmp/target.gif. try: from PIL import Image except ImportError: import…
showkey
  • 482
  • 42
  • 140
  • 295
7
votes
1 answer

converting pdf to image but after zooming in

This link shows how pdfs could be converted to images. Is there a way to zoom my pdfs before converting to images? In my project, i am converting pdfs to pngs and then using Python-tesseract library to extract text. I noticed that if I zoom pdfs and…
user2543622
  • 5,760
  • 25
  • 91
  • 159
7
votes
1 answer

How to Create Traineddata file For Tesseract 4.1.0

I want to recognise the characters of NumberPlate. How to train the tesseract-ocr for respective number plate in ubuntu 16.04. Since i don't familiar with training. Please help me to create a 'traineddata' file for recognizing numberplate. I have…
7
votes
1 answer

Empty string with Tesseract

I'm trying to read different cropped images from a big file and I manage to read most of them but there are some of them which return an empty string when I try to read them with tesseract. The code is just this…
Alberto Carmona
  • 467
  • 1
  • 9
  • 23
7
votes
3 answers

Get orientation pytesseract Python3

I want to get the orientation of a scanned document. I saw this post Pytesseract OCR multiple config options and I tried to use --psm 0 to get the orientation. target = pytesseract.image_to_string(text, lang='eng', boxes=False, \ config='--psm 0…
lads
  • 1,125
  • 3
  • 15
  • 29
7
votes
1 answer

Image Preprocessing for OCR - Tessaract

Obviously this image is pretty tough as it is low clarity and is not a real word. However, with this code, I'm detecting nothing close: import pytesseract from PIL import Image, ImageEnhance, ImageFilter image_name = 'NedNoodleArms.jpg' im =…
7
votes
2 answers

Extract text from image using OCR in python

I want to extract text from a specific area of the image like the name and ID number from identity card. The ID card from which I want to extract text is in the Chinese language(Chinese ID card). I have tried this code but it just extracts the…
Tehseen
  • 115
  • 2
  • 14