1.) The pdf data can't be read directly. Why is it so?
2.) I have taken each page and stored it as an image and then used the tesseract to recognize the text.
3.) The text can't be recognized properly because of the watermark in the background.
4.) Remove the watermark(in general solution required) The pdf is: https://drive.google.com/open?id=1pXJSdvYoIVfdTog14sOhDUmxAKJTBYWd
1.) For reading the pdf directly, I have used PyPDF2 and pdftotext but both of them returned an empty list. 2) I have converted each pdf page to an image and then provided that image to the Tesseract in order to recognize the text but the watermark is creating the problem in recognizing the text.
#Store all the pages of the pdf in a variable
pages = convert_from_path('sample.pdf', 500)
#Counter to store images of each page of PDF to image
image_counter = 1
for page in pages:
filename = "page_" + str(image_counter)+".jpg"
# Save the image of the page in system
page.save(filename, 'JPEG')
#Incrementing the image counter variable
image_counter += 1
output_file = open("output.txt", 'a')
for i in range(1, image_counter):
filename = "page_"+str(i)+".jpg"
img = cv2.imread(filename)
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
gray = cv2.adaptiveThreshold(gray, 255,
.ADAPTIVE_THRESH_GAUSSIAN_C,
cv2.THRESH_BINARY, 115, 1)
cv2.imshow("Processed image", gray)
cv2.waitKey(0)
cv2.destroyAllWindows()
text = str(pytesseract.image_to_string(gray))
print(text)
output_file.write(text)
I need to output country, address, zipcode, mortgagee bank name and in general I need to output specific details from the read text. So what all issues I am facing are:
1.) The pdf is not being read directly, it needs to be converted to images first and then needs to be fed into the tesseract to recognize the text.
2.) The watermark is making it unable to read the text properly.
3.) For taking out specific fields of the document what is preferable? Should I go for regex or search the whole document for the heading and then take the detail from it or any other method?
Please help!!!