I have a grayscale image of printed text. I want to extract every individual character from the image so that I can save them as discrete images. I don't want to recognise what the character is, I just want each glyph as a separate file.
I'm using cv2
, for example:
# Find contours to isolate individual letters
contours, _ = cv2.findContours(binary_image, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)`
That works perfectly for contiguous characters - that is, where the shape of the glyph has no breaks.
But it doesn't work on characters like i
, j
, :
, and ;
- the dots on top are not included.
Is there a way to use CV2 to detect these characters? I know the document uses only Latin letters, numbers, and punctuation.
The document uses a fairly archaic typeface and doesn't work well with Tesseract or other traditional OCR engines - which is why I want to detect the individual letters, rather than try to recognise them.