3

1.) The pdf data can't be read directly. Why is it so?

2.) I have taken each page and stored it as an image and then used the tesseract to recognize the text.

3.) The text can't be recognized properly because of the watermark in the background.

4.) Remove the watermark(in general solution required) The pdf is: https://drive.google.com/open?id=1pXJSdvYoIVfdTog14sOhDUmxAKJTBYWd

1.) For reading the pdf directly, I have used PyPDF2 and pdftotext but both of them returned an empty list. 2) I have converted each pdf page to an image and then provided that image to the Tesseract in order to recognize the text but the watermark is creating the problem in recognizing the text.

#Store all the pages of the pdf in a variable

pages = convert_from_path('sample.pdf', 500)

#Counter to store images of each page of PDF to image

image_counter = 1

for page in pages:
    filename = "page_" + str(image_counter)+".jpg"
    # Save the image of the page in system
    page.save(filename, 'JPEG')

    #Incrementing the image counter variable
    image_counter += 1


output_file = open("output.txt", 'a')

for i in range(1, image_counter):
    filename = "page_"+str(i)+".jpg"
    img = cv2.imread(filename)
    gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    gray = cv2.adaptiveThreshold(gray, 255, 
.ADAPTIVE_THRESH_GAUSSIAN_C, 
                             cv2.THRESH_BINARY, 115, 1)
cv2.imshow("Processed image", gray)
cv2.waitKey(0)
cv2.destroyAllWindows()
text = str(pytesseract.image_to_string(gray))
print(text)
output_file.write(text)

I need to output country, address, zipcode, mortgagee bank name and in general I need to output specific details from the read text. So what all issues I am facing are:

1.) The pdf is not being read directly, it needs to be converted to images first and then needs to be fed into the tesseract to recognize the text.

2.) The watermark is making it unable to read the text properly.

3.) For taking out specific fields of the document what is preferable? Should I go for regex or search the whole document for the heading and then take the detail from it or any other method?

Please help!!!

Aparajit Garg
  • 41
  • 1
  • 3
  • Just for your info, check http://blog.uorz.me/2018/06/19/removeing-watermark-with-PyPDF2.html. and https://github.com/Goshin/Remove-PDF-Watermark – stormzhou Jul 15 '19 at 18:50
  • @stormzhou the first link needs location and I am trying to make a generalized script, the second link reads a pdf document as text but this particular pdf can't be read as text. Anything else you can recommend? – Aparajit Garg Jul 16 '19 at 05:24
  • Nothing in particular, I just did a brief search on google and found out these sources for your information. Good luck with your task. – stormzhou Jul 16 '19 at 14:17
  • Ok thanks for the effort. – Aparajit Garg Jul 17 '19 at 15:14
  • Your PDF merely is a container for some bitmaps - each page consists of a single bitmap image. Thus, all PDF libraries can do for you is allow you to extract those bitmaps. After that you need to go for image analysis / OCR. – mkl Jul 19 '19 at 12:53
  • @mkl ok but how to process the those areas having a watermark on it? The text seems to mix up with that and the OCR couldn't extract the text properly. – Aparajit Garg Jul 22 '19 at 04:59
  • As the watermark is not added by separate PDF instructions but instead is content of the bitmap image, you will need appropriate bitmap image analysis tools for that, not PDF tools. I'm not into that topic, so I cannot help. You might want to update your question, though, to better address people who are. Currently your question addresses foremost PDF experts, not bitmap analysis experts. – mkl Jul 22 '19 at 12:53
  • @mkl ok thank you have edited the tags. Hope someone will be able to help now. Thanks. – Aparajit Garg Jul 24 '19 at 06:13

0 Answers0