3

I have a grayscale image of printed text. I want to extract every individual character from the image so that I can save them as discrete images. I don't want to recognise what the character is, I just want each glyph as a separate file.

I'm using cv2, for example:

# Find contours to isolate individual letters
contours, _ = cv2.findContours(binary_image, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)`

That works perfectly for contiguous characters - that is, where the shape of the glyph has no breaks.

But it doesn't work on characters like i, j, :, and ; - the dots on top are not included.

Is there a way to use CV2 to detect these characters? I know the document uses only Latin letters, numbers, and punctuation.

The document uses a fairly archaic typeface and doesn't work well with Tesseract or other traditional OCR engines - which is why I want to detect the individual letters, rather than try to recognise them.

Terence Eden
  • 14,034
  • 3
  • 48
  • 89
  • Thanks @ChristophRackwitz that's helpful. Given that I want to extract individual characters as images, which OCR engine would you recommend that I use? – Terence Eden Jul 26 '23 at 07:35
  • Thanks again @ChristophRackwitz when you say "train a simple classifier" could you point me to some resources you personally recommend? – Terence Eden Jul 26 '23 at 07:37

1 Answers1

2

I used OpenCV's Erode / Dilate function to erode the image vertically.

kernel = np.array([[0, 0, 0, 0, 0],
                   [0, 0, 1, 0, 0],
                   [0, 0, 1, 0, 0],
                   [0, 0, 1, 0, 0],
                   [0, 0, 0, 0, 0]], dtype=np.uint8)

erode = cv2.erode(image, kernel, iterations = 6)

That transformed this:

Old printed text

Into this:

Text which has been vertically deformed

That joined the dots on the i and ? characters while leaving enough horizontal space to make detection possible.

I did the detection on the eroded image, but applied the cropping to the original image.

Terence Eden
  • 14,034
  • 3
  • 48
  • 89