1

I'm attempting to split the handwritten text from a dataset of NIST forms into separate lines. Here is a link to the dataset: https://www.nist.gov/srd/nist-special-database-19

Example Image Example Form

The code I'm using is based off of a similar question on stackoverflow but it doesn't quite work due to some the characters touching. Here is the code:

import cv2
import numpy as np
#import image
image = cv2.imread('form1.jpg')
#cv2.imshow('orig',image)
#cv2.waitKey(0)

#grayscale
gray = cv2.cvtColor(image,cv2.COLOR_BGR2GRAY)
cv2.imshow('gray',gray)
cv2.waitKey(0)

#binary
ret,thresh = cv2.threshold(gray,127,255,cv2.THRESH_BINARY_INV)
cv2.imshow('second',thresh)
cv2.waitKey(0)

#dilation
kernel = np.ones((5,100), np.uint8)
img_dilation = cv2.dilate(thresh, kernel, iterations=1)
cv2.imshow('dilated',img_dilation)
cv2.waitKey(0)

#find contours
im2,ctrs, hier = cv2.findContours(img_dilation.copy(), cv2.RETR_EXTERNAL, 
cv2.CHAIN_APPROX_SIMPLE)

#sort contours
sorted_ctrs = sorted(ctrs, key=lambda ctr: cv2.boundingRect(ctr)[0])

for i, ctr in enumerate(sorted_ctrs):
    # Get bounding box
    x, y, w, h = cv2.boundingRect(ctr)

    # Getting ROI
    roi = image[y:y+h, x:x+w]

    # show ROI
   cv2.imshow('segment no:'+str(i),roi)
   cv2.rectangle(image,(x,y),( x + w, y + h ),(90,0,255),2)
   cv2.waitKey(0)

cv2.imshow('marked areas',image)
cv2.waitKey(0)        

How can I get it to split the lines properly even when some of the characters are overlapping?

HAL
  • 33
  • 1
  • 5

1 Answers1

0

I am working on similar problem, and the sample is of a quite good quality.

From the code that is given I can see that you use Contour Detection. You may want to play with aspect ratio restriction on detected contours in order to omit connected components. Here you can find insights for your project as well as description on how to make aspects ratio restriction.

If you still want to keep them, then you will need to perform post processing. Either it will involve morphology alternation for those elements, or application of some sort of Machine/Deep Learning.

That is the most complicated part and it might have many different solutions. For my project I used Convolutional Neural Network with Keras in order to train the model to classify letters, and put outliers in separate class. As you might have guessed I have added couple more classes with training data (thousands of training examples will be needed) to the existing dataset of MNIST-like letters.

Good luck!

Bolat Tleubayev
  • 1,765
  • 3
  • 14
  • 16