I have written a program that ask for the user to upload the DMV license picture and reads the DMV License details from the uploaded picture using tesseract ocr. I've got the tesseract part work which does well to some extent. I have got a raw string and now I need to parse that string to fetch user details. The problem is that some fields on DMV license have no lables. Like name, address etc. I need to fetch these details. I can't come up with idea (may be I can use regex but don't know how to get it work?). If someone already did that I'd love to take a look. . Any suggestion would be welcome.
CODE
Here is the code to read the uploaded file and get text from image using tesseract.
from django.http import HttpResponse
from django.views.decorators.csrf import csrf_exempt
try:
from PIL import Image
except ImportError:
import Image
import pytesseract
# replace the path with the path to tesseract installation directory on server.
pytesseract.pytesseract.tesseract_cmd = "C:\Program Files (x86)\Tesseract-OCR\\tesseract.exe"
@csrf_exempt
def upload_dmv(request):
if request.method == "POST":
dmv = request.FILES['dmv']
extracted_data = pytesseract.image_to_string(Image.open(dmv))
print(extracted_data)
return HttpResponse(b'OK')
Output
W YORK STALE
DRIVER Ee C{ENeSae
876 071652
BOGADO
PETER,GIOVANNI.
9520 93RD ST FL 2
OZONE PARK, NY 114116
SexM_ Height6'-02" Eyes BRO:
00806/06/1992
Expires 06/06/2018
ENONE
RB
Issued 03/09/2017
Usa
~e-h.
Crecutive Deputy Comminsioner of Motor
Class E