How to extract the pixels of a specific color for OCR?

Question

I want to run some small images/sprites through OCR (Tesseract, probably) and extract a number or words out of it, and I know these number/words will be of a specific color (let's say white on a noisy/colored background).

While reading about pre-processing images for OCR, I thought it would be really beneficial to just remove everything that's not white from the image.

I'm using both imagemagick and vips but I have no idea where to start, what operations to use and how to search for it.

Mark Setchell · Accepted Answer · 2020-08-07T11:05:24.440

If we make a sample image like this:

magick -size 300x100 xc: +noise random -gravity center -fill white -pointsize 48 -annotate 0 "Hello" captcha.png

You can then fill with black anything that is not white:

magick captcha.png -fill black +opaque white result.png

If you want to accept colours close to white as being white, you can include some "fuzz":

magick captcha.png -fuzz 10% -fill black +opaque white result.png

score 2 · Answer 2 · answered Aug 07 '20 at 11:41

There was a discussion on the libvips tracker a few months ago about techniques for background removal:

https://github.com/libvips/libvips/issues/1567

Here's the filter:

#!/usr/bin/python3

import sys 
import pyvips

image = pyvips.Image.new_from_file(sys.argv[1], access="sequential")

# aim for 250 for paper with low freq. removal
# ink seems to be slightly blueish
paper = 250
ink = [150, 160, 170]

# remove low frequencies .. don't need huge accuracy
low_freq = image.gaussblur(20, precision="integer")
image = image - low_freq + paper

# pull the ink down
ink_target = 30
scale = [(paper - ink_target) / (paper - i) for i in ink]
offset = [ink_target - i * s for i, s in zip(ink, scale)]
image = image * scale + offset

# find distance to white of each pixel ... small distances go to white
white = [100, 0, 0]
image = image.colourspace("lab")
d = image.dE76(white)
image = (d < 12).ifthenelse(white, image)

# boost saturation (scale ab)
image = image * [1, 2, 2]

image.write_to_file(sys.argv[2])

It removes low frequences (ie. paper folds etc.), stretches the contrast range, finds pixels close to white in CIELAB and moves them to white, and boosts saturation.

You'd probably need to tune it a bit for your use-case. Post some sample images if you need more advice.

score 1 · Answer 3 · answered Aug 07 '20 at 03:16

1

I'm no expert in this area, but maybe try changing all pixels with RGB values below a certain threshold to black, or delete them? As I mentioned before, I'm not very knowledgeable in any of this, but I don't see why this wouldn't work.

answered Aug 07 '20 at 03:16

Isaac

51
1
9

1

This makes sense, but that wouldn't work for other colors, right? I'd like to avoid manually going through each pixel and filtering them out based on RGB values – Thiago Belem Aug 07 '20 at 03:25
1

I think it would work for other colors, you'd probably have to establish an upper threshold too, though, and I can't think of a way to remove the noise/other colors without doing it pixel by pixel, although there are probably tools in imagemagick, vips, and whatever language you're using to do it a different way that I haven't thought or learned of. – Isaac Aug 07 '20 at 03:27

score 0 · Answer 4 · answered Aug 07 '20 at 08:28

If the images are synthetic and uncompressed, you can test for strict equality of the RGB values. Otherwise, use a threshold on the distance between the RGB triples (Euclidean or Manhattan for instance).

If you want to allow variations in the lightness but not in the color, you can convert to HLS and compare HS.

How to extract the pixels of a specific color for OCR?

4 Answers4