text missing while reading pdf python

Asked Oct 21 '22 at 11:14

Active Oct 21 '22 at 11:14

Viewed 22 times

Hi i am trying read pdf file in python
one of the text as shown below reading as "METER READING DATES: 04 8 2TO05 7 2"
below is the my code:

pdf_path = pdf_path
poppler_path=r'C:\poppler-0.68.0\bin'
images = pdf2image.convert_from_path(pdf_path,poppler_path=r'C:\poppler-0.68.0\bin')
print (value)
pil_im = images[0] # assuming that we're interested in the first page only
ocr_dict = pytesseract.image_to_data(pil_im, lang='eng', output_type=Output.DICT)
text = " ".join(ocr_dict['text'])

Now my question is how to read text properly as shown in picture . Thanks in advance .

asked Oct 21 '22 at 11:14

san1

the issue seems to be the "/2" you can try applying an opening morphology operator to the image before passing it to the OCR. This will separate the / and the 2 and hopefully will be correctly recognized. Take a look at https://docs.opencv.org/4.x/d9/d61/tutorial_py_morphological_ops.html – Oct 21 '22 at 11:22

text missing while reading pdf python

0 Answers0