19

I have a few thousands of PDF files containing B&W images (1bit) from digitalized paper forms. I'm trying to OCR some fields, but sometime the writing is too faint:

enter image description here

I've just learned about morphological transforms. They are really cool!!! I feel like I'm abusing them (like I did with regular expressions when I learned Perl).

I'm only interested in the date, 07-06-2017:

im = cv2.blur(im, (5, 5))
plt.imshow(im, 'gray')

enter image description here

ret, thresh = cv2.threshold(im, 250, 255, 0)
plt.imshow(~thresh, 'gray')

enter image description here

People filling this form seems to have some disregard for the grid, so I tried to get rid of it. I'm able to isolate the horizontal line with this transform:

horizontal = cv2.morphologyEx(
    ~thresh, 
    cv2.MORPH_OPEN, 
    cv2.getStructuringElement(cv2.MORPH_RECT, (100, 1)),
)
plt.imshow(horizontal, 'gray')

enter image description here

I can get the vertical lines as well:

plt.imshow(horizontal ^ ~thresh, 'gray')

ret, thresh2 = cv2.threshold(roi, 127, 255, 0)
vertical = cv2.morphologyEx(
    ~thresh2, 
    cv2.MORPH_OPEN, 
    cv2.getStructuringElement(cv2.MORPH_RECT, (2, 15)), 
    iterations=2
)
vertical = cv2.morphologyEx(
    ~vertical, 
    cv2.MORPH_ERODE, 
    cv2.getStructuringElement(cv2.MORPH_RECT, (9, 9))
)
horizontal = cv2.morphologyEx(
    ~horizontal, 
    cv2.MORPH_ERODE, 
    cv2.getStructuringElement(cv2.MORPH_RECT, (7, 7))
)
plt.imshow(vertical & horizontal, 'gray')

enter image description here

Now I can get rid of the grid:

plt.imshow(horizontal & vertical & ~thresh, 'gray')

enter image description here

The best I got was this, but the 4 is still split into 2 pieces:

plt.imshow(cv2.morphologyEx(im2, cv2.MORPH_CLOSE, 
    cv2.getStructuringElement(cv2.MORPH_ELLIPSE, (5, 5))), 'gray')

enter image description here

Probably at this point it is better to use cv2.findContours and some heuristic in order to locate each digit, but I was wondering:

  1. should I give up and demand all documents to be rescanned in grayscale?
  2. are there better methods in order to isolate and locate the faint digits?
  3. do you know any morphological transform to join cases like the "4"?

[update]

Is rescanning the documents too demanding? If it is no great trouble I believe it is better to get higher quality inputs than training and trying to refine your model to withstand noisy and atypical data

A bit of context: I'm a nobody working at a public agency in Brazil. The price for ICR solutions start in the 6 digits so nobody believes a single guy can write an ICR solution in-house. I'm naive enough to believe I can prove them wrong. Those PDF documents were sitting at an FTP server (about 100K files) and were scanned just to get rid of the dead tree version. Probably I can get the original form and scan again myself but I would have to ask for some official support - since this is the public sector I would like to keep this project underground as much as I can. What I have now is an error rate of 50%, but if this approach is a dead end there is no point trying to improve it.

Paulo Scardine
  • 73,447
  • 11
  • 124
  • 153
  • 1
    Is rescanning the documents too demanding? If it is no great trouble I believe it is better to get higher quality inputs than training and trying to refine your model to withstand noisy and atypical data – DarkCygnus Jul 10 '17 at 21:09
  • @GrayCygnus: I would have to cross an ocean of bureaucracy and inertia, but it is possible. I would probably have to do all the manual work myself. – Paulo Scardine Jul 10 '17 at 21:36
  • May I also suggest you take a look at this [tutorial](http://www.pyimagesearch.com/2017/07/10/using-tesseract-ocr-python/) (from the same source as the one I linked on my answer to your previous question), where they introduce Tesseract (a wrapper of Googles OCR Engine) as a great tool for doing OCR. Also, I found [this paper](http://worldcomp-proceedings.com/proc/p2016/ICA3674.pdf) that explains how to improve character recognition using K-nearest neighbors with euclidean distance metric. Good luck traversing that ocean :) – DarkCygnus Jul 10 '17 at 21:46
  • 1
    By the way I'm already using `pytesseract` with success for grabbing the printed form number. I've successfully linked 70,000 images with the corresponding record in a database fed by professional human typists. This is already useful, as I found lots of documents that should be in the database but aren't. Politically it is a gamble: I would make some enemies by writing a system that uncovers theirs screw ups so I was hoping to show something else. – Paulo Scardine Jul 10 '17 at 21:57
  • Handwriting recognition is something Neural Networks are supposed to be great at, and there are plenty of free .net implementations; and they often come with character recognition sample sets as their 'go to' examples. – PhillipH Aug 03 '17 at 09:20
  • @PhillipH You are right. This makes recognition the easiest part, the hard work is to preprocess the image in order to locate the digits and normalize the samples. For example, when a digit is split like the "4" in the above image, sometimes the algorithm takes it by "11" or "14". – Paulo Scardine Aug 03 '17 at 14:15
  • @PauloScardine - I have no background in neural networks but you shouldnt need to preprocess it; given enough samples and training it should do the 'preprocesing' step for you. Or at least thats what my book says ... I dont suppose this is getting you nearer your answer though :0) – PhillipH Aug 03 '17 at 17:02

1 Answers1

8

Maybe with Active contour model of some sort? For example, I found this library: https://github.com/pmneila/morphsnakes

Took your final "4" number:

enter image description here

After some quick tweaking (without actually understanding the parameters, so it may be possible to get a better result) I got this:

enter image description here

with the following code (I also hacked into the morphsnakes.py a little to save the images):

import morphsnakes

import numpy as np
from scipy.misc import imread
from matplotlib import pyplot as ppl

def circle_levelset(shape, center, sqradius, scalerow=1.0):
    """Build a binary function with a circle as the 0.5-levelset."""
    grid = np.mgrid[list(map(slice, shape))].T - center
    phi = sqradius - np.sqrt(np.sum((grid.T)**2, 0))
    u = np.float_(phi > 0)
    return u

#img = imread("testimages/mama07ORI.bmp")[...,0]/255.0
img = imread("four.png")[...,0]/255.0

# g(I)
gI = morphsnakes.gborders(img, alpha=900, sigma=3.5)

# Morphological GAC. Initialization of the level-set.
mgac = morphsnakes.MorphGAC(gI, smoothing=1, threshold=0.29, balloon=-1)
mgac.levelset = circle_levelset(img.shape, (39, 39), 39)

# Visual evolution.
ppl.figure()
morphsnakes.evolve_visual(mgac, num_iters=50, background=img)
Headcrab
  • 6,838
  • 8
  • 40
  • 45