1

I was extracting data from scanned pdf by tesseract ocr and I am able to extract data but the accuracy is not good. At many places, its showing wrong data so can I get data with 100% accuracy by python.

first I convert pdf to jpg format then I extract data from the image using tesseract module.

from PIL import Image
import pytesseract

text=(pytesseract.image_to_string(Image.open(r"C:\Users\sumesh\Desktop\ip\ip\pdf11.jpg")))
text=repr(text)
text=text.replace(r"\n","")
print(text)

I expected proper data from pdf but I am getting different data for eg.z is showing 2,5 is s,1 is I, etc

Lord Elrond
  • 13,430
  • 7
  • 40
  • 80
Sumesh Kumar
  • 11
  • 1
  • 2

2 Answers2

-1

Hope the below small changes will help you.

from PIL import Image
import pytesseract

text=str(pytesseract.image_to_string(Image.open(r"C:\Users\sumesh\Desktop\ip\ip\pdf11.jpg"),lang='eng'))

text=text.replace("\n","")

print(text)
-1

Please use "DPI=500" after your file path, it might help.. For more info you can follow my answer posted here How to convert .png images to searchable PDF/word using Python

Deepak
  • 430
  • 1
  • 7
  • 14