Questions tagged [python-tesseract]

Python-tesseract is a wrapper class for Tesseract OCR that allows any conventional image files (JPG, GIF, PNG, TIFF, etc.) to be read and get its text, data of text, or even convert it to pdf.

Python-tesseract is a wrapper class for OCR that allows any conventional image files (JPG, GIF, PNG, TIFF, etc.) to be read and decoded into usable text.

Tesseract is advertised as the most accurate open source OCR engine available. It was developed at HP Labs between 1985 and 1995 and then remained dormant until 2006 when Google revived the project.

For more information, please see the Python-tesseract page or the Tesseract page.

1664 questions
5
votes
1 answer

How to get character wise confidence in tesseract using command line?

I am able to get word level confidence score using tesseract 4.0 through the command line. Interested to know if there is a way to get the character confidence too. For word level confidence used the below command: tesseract [Image name] outputbase…
5
votes
2 answers

No module named pytesseract error

I am trying to use pytesseract for OCR, on a raspberry pi using Raspbian I have read several questions on this topic, but can't find an answer that works, they usually say to install pytesseract with pip, and I did it. my code is very simple: import…
droledenom
  • 207
  • 2
  • 6
  • 18
5
votes
2 answers

Extracting text out of images

I am working on extracting text out of images. Initially images are colored with text placed in white, On further processing the images, the text is shown in black and other pixels are white (with some noise), here is a sample: Now when I try OCR…
5
votes
4 answers

Pytesseract: Error opening data file \\Program Files (x86)\\Tesseract-OCR\\en.traineddata

I am trying to use pytesseract on Jupyter Notebook. Windows 10 x64 Running Jupyter Notebook (Anaconda3, Python 3.6.1) with administrative privilege The work directory containing TIFF file is in different drive (Z:) When I run the following…
Henry
  • 401
  • 2
  • 6
  • 20
5
votes
4 answers

How to reduce wand memory usage?

I am using wand and pytesseract to get the text of pdfs uploaded to a django website like so: image_pdf = Image(blob=read_pdf_file, resolution=300) image_png = image_pdf.convert('png') req_image = [] final_text = [] for img in image_png.sequence: …
5
votes
1 answer

Using multiple languages in Pytesser

I have started to use Pytesser, which works great with both english and chinese, but is there a way to have both languages work at the same time? Would I have to make my own traineddata file? My code is: import Image from pytesser import * print…
Dave Lin
  • 68
  • 1
  • 1
  • 8
5
votes
4 answers

How to get Hocr output using python-tesseract

I had been getting really good results using pytesseract but it is not able to preserve double spaces and they are really important for me. And, so i decided to retrieve hocr output rather than pure text.But;there doesn't appear to be any way of…
Anurag
  • 59
  • 1
  • 1
  • 6
5
votes
2 answers

Pytesseract: UnicodeDecodeError: 'charmap' codec can't decode byte

I'm running a large number of OCRs on screenshots with Pytesseract. This is working well in most cases, but a small number is causing this error: pytesseract.image_to_string(image,None, False, "-psm 6") Pytesseract: UnicodeDecodeError: 'charmap'…
Nickpick
  • 6,163
  • 16
  • 65
  • 116
5
votes
1 answer

tesseract reading values from a table

My question follows this post about extracting data from a table in an image using OCR. I'm using tesseract to convert a table image to text. This works well except that the format of the table is not preserved. One solution is to replace the…
DJJ
  • 2,481
  • 2
  • 28
  • 53
5
votes
1 answer

Tesseract OCR: Parameter for Font Size (Single Character)

I want to use Tesseract to recognize a single noiseless character with a typical font (ex. Times New Roman, Arial, etc. No weird font). The input image just contains the character, so the input image size is equivalent to the font size. I already…
4
votes
0 answers

pytesseract does not extract text from image

I am having the following image and trying to extract the text using pytesseract. But, it always returns some unknown character. Image This is the code I am using: import pytesseract as pt from PIL import Image #Converting image to text img =…
JAMSHAID
  • 1,258
  • 9
  • 32
4
votes
3 answers

Adjusting pytesseract parameters

Note: I am migrating this question from Data Science Stack Exchange, where it received little exposure. I am trying to implement an OCR solution to identify the numbers read from the picture of a screen. I am adapting this pyimagesearch tutorial to…
Sheldon
  • 4,084
  • 3
  • 20
  • 41
4
votes
1 answer

pytesseract improving OCR accuracy for blurred numbers on an image

Example of numbers I am using the standard pytesseract img to text. I have tried with digits only option 90% of the time it is perfect but above is a example where it goes horribly wrong! This example produced no characters at all As you can see…
4
votes
2 answers

How to find numbers in images and read them?

I have this picture: and this is my Region of Interest: which is a number that I would like to recognize and "read". I don't know why I can't detect it using pytesseract. Even though I preprocess it and get this image free of noise: Here is the…
Alexandre Tavares
  • 113
  • 1
  • 1
  • 11
4
votes
1 answer

Pytesseract doesnt recognize simple text in image

I want to recognize a image like this: I am using the following config: config="--psm 6 --oem 3 -c tessedit_char_whitelist=0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ,." but when I try to convert that, I get the following: 1581 1 W I think that the…