Pdf data extraction from scanned pdf using python

Question

I was extracting data from scanned pdf by tesseract ocr and I am able to extract data but the accuracy is not good. At many places, its showing wrong data so can I get data with 100% accuracy by python.

first I convert pdf to jpg format then I extract data from the image using tesseract module.

from PIL import Image
import pytesseract

text=(pytesseract.image_to_string(Image.open(r"C:\Users\sumesh\Desktop\ip\ip\pdf11.jpg")))
text=repr(text)
text=text.replace(r"\n","")
print(text)

I expected proper data from pdf but I am getting different data for eg.z is showing 2,5 is s,1 is I, etc

score -1 · Answer 1 · answered Nov 22 '19 at 02:06

-1

Hope the below small changes will help you.

from PIL import Image
import pytesseract

text=str(pytesseract.image_to_string(Image.open(r"C:\Users\sumesh\Desktop\ip\ip\pdf11.jpg"),lang='eng'))

text=text.replace("\n","")

print(text)

answered Nov 22 '19 at 02:06

Arun vignesh.M

1
1

1

This may be the correct answer, but it would be more helpful if it also included *why* it's the correct answer :) – MyStackRunnethOver Nov 22 '19 at 02:17

score -1 · Answer 2 · answered Dec 11 '19 at 16:24

-1

Please use "DPI=500" after your file path, it might help.. For more info you can follow my answer posted here How to convert .png images to searchable PDF/word using Python

answered Dec 11 '19 at 16:24

Deepak

430
1
7
14

Pdf data extraction from scanned pdf using python

2 Answers2

Linked