1

I'm trying to extract handwritten information from a scanned account opening form. For this i have use Pytesseract python library for extracting text data. But using this module i'am having a lot of irregularities in the output, as i'am getting uneven characters.

Also the boxes in the form in which user write their personal information like Name, Address, DOB etc is also causing problem as the module pytesseract detecting it as the letter 'I'. So is there any way to deal with these boxes?

Also is there any other way to approach this task? if there is please suggest.This is the Scanned form on which i'm working

Below is the code i have done

import matplotlib.pyplot as plt
import pytesseract
from PIL import Image
from nltk.tokenize import sent_tokenize, word_tokenize

image = Image.open('printer1.jpg')
print(image.info['dpi'])
image.save("new_img.jpg", dpi=(400,400)) # increased the dpi and saved it

new_img = Image.open('new_img.jpg')
width, height = new_img.size

new_size = width*2, height*2
new_img = new_img.resize(new_size, Image.LANCZOS) #sampling
new_img = new_img.convert('L') #converted it to grayscale

new_img = new_img.point(lambda x: 0 if x < 180 else 255, '1') 
#evaluatingevery single pixel in the image for binarization

plt.imshow(new_img)
plt.show()

text = pytesseract.image_to_string(new_img)
text_array = word_tokenize(text)
print(text_array)

Name_Data = text_array[text_array.index('Proof')+2 : 
text_array.index('FIRST')-1]
print(Name_Data)

Name = ""
for i in Name_Data:
   if i == 'I':
       pass
   else:
       Name += i

print(Name)
Dan Mašek
  • 17,852
  • 6
  • 57
  • 85
mnm
  • 11
  • 2
  • Welcome to Stack Overflow! It is preferred if you can post separate questions instead of combining your questions into one. That way, it helps the people answering your question and also others hunting for at least one of your questions. Recognizing handwriting is a very complex topic - recognizing the "filled in" parts of your form is a second complex topic, image preprocessing is a big topic, your question is far too broad - it is not really clear _where_ your problem is. Asking "How to detect handwriting in a scan of a form and parse it to correct data" is _far_ to broad to be covered here. – Patrick Artner Jan 02 '19 at 11:54
  • As an avenue of approach: if it is the same form all over, try to get a scan of it without writing, and (after aligning it with the scan) subtract it from the scanned image to get rid of all text/lines that are "fixed". Segment the areas of the scan into what inputs belong to what datafields and go from there. Viable? No idea. HTH. – Patrick Artner Jan 02 '19 at 11:58

0 Answers0