I'm trying to extract handwritten information from a scanned account opening form. For this i have use Pytesseract python library for extracting text data. But using this module i'am having a lot of irregularities in the output, as i'am getting uneven characters.
Also the boxes in the form in which user write their personal information like Name, Address, DOB etc is also causing problem as the module pytesseract detecting it as the letter 'I'. So is there any way to deal with these boxes?
Also is there any other way to approach this task? if there is please suggest.This is the Scanned form on which i'm working
Below is the code i have done
import matplotlib.pyplot as plt
import pytesseract
from PIL import Image
from nltk.tokenize import sent_tokenize, word_tokenize
image = Image.open('printer1.jpg')
print(image.info['dpi'])
image.save("new_img.jpg", dpi=(400,400)) # increased the dpi and saved it
new_img = Image.open('new_img.jpg')
width, height = new_img.size
new_size = width*2, height*2
new_img = new_img.resize(new_size, Image.LANCZOS) #sampling
new_img = new_img.convert('L') #converted it to grayscale
new_img = new_img.point(lambda x: 0 if x < 180 else 255, '1')
#evaluatingevery single pixel in the image for binarization
plt.imshow(new_img)
plt.show()
text = pytesseract.image_to_string(new_img)
text_array = word_tokenize(text)
print(text_array)
Name_Data = text_array[text_array.index('Proof')+2 :
text_array.index('FIRST')-1]
print(Name_Data)
Name = ""
for i in Name_Data:
if i == 'I':
pass
else:
Name += i
print(Name)