2

So I have these PDFs that are scanned copies of a structured feedback form. The form has these checkboxes and spaces for hand written notes. I am trying to extract the data from these PDFs and save it to an unstructured CSV file. Now using pytesseract I am able to grab the printed text (by first converting the PDF to image) but I am not able to capture the handwritten content. Is there any of doing it. I am enclosing a sample form for reference.

!https://i.stack.imgur.com/NoNMt.jpg

PranavM
  • 23
  • 1
  • 9
  • 1
    Extract is a problem, recognize and save to a csv is another (bigger than the first one). With a bit of work you can extract the text but I don't know if recognizing it is possible. You can try a vision API provided by Google/Amazon/Microsoft to see if the results could be acceptable. If not I don't think it's possible to recognize HW data. To be honest, it's difficult even for a human to read the text in birthday and anniversary form.. – lucians Sep 28 '20 at 14:04

1 Answers1

1

PyTesseract is an OCR program. It has not been trained or designed to recognize handwriting. So you have two options: 1) Retrain it for handwriting (this would be quite time-consuming and complicated though) 2) Use another library actually meant for recognizing handwriting and not printed text like this one: https://learn.microsoft.com/en-us/azure/cognitive-services/computer-vision/quickstarts/python-hand-text

  • Its difficult to extract hand written text from an image by using pre-trained library, as handwritting of everyone is different from each other. In that case we need to train our model and use that to extract text. Reference link : https://towardsdatascience.com/build-a-handwritten-text-recognition-system-using-tensorflow-2326a3487cd5 – vishal yadav Dec 26 '19 at 07:02