2

I want to run some small images/sprites through OCR (Tesseract, probably) and extract a number or words out of it, and I know these number/words will be of a specific color (let's say white on a noisy/colored background).

While reading about pre-processing images for OCR, I thought it would be really beneficial to just remove everything that's not white from the image.

I'm using both imagemagick and vips but I have no idea where to start, what operations to use and how to search for it.

Thiago Belem
  • 7,732
  • 5
  • 43
  • 64

4 Answers4

3

If we make a sample image like this:

magick -size 300x100 xc: +noise random -gravity center -fill white -pointsize 48 -annotate 0 "Hello" captcha.png

enter image description here

You can then fill with black anything that is not white:

magick captcha.png -fill black +opaque white result.png

enter image description here

If you want to accept colours close to white as being white, you can include some "fuzz":

magick captcha.png -fuzz 10% -fill black +opaque white result.png

enter image description here

enter image description here

Mark Setchell
  • 191,897
  • 31
  • 273
  • 432
2

There was a discussion on the libvips tracker a few months ago about techniques for background removal:

https://github.com/libvips/libvips/issues/1567

Here's the filter:

#!/usr/bin/python3

import sys 
import pyvips

image = pyvips.Image.new_from_file(sys.argv[1], access="sequential")

# aim for 250 for paper with low freq. removal
# ink seems to be slightly blueish
paper = 250
ink = [150, 160, 170]

# remove low frequencies .. don't need huge accuracy
low_freq = image.gaussblur(20, precision="integer")
image = image - low_freq + paper

# pull the ink down
ink_target = 30
scale = [(paper - ink_target) / (paper - i) for i in ink]
offset = [ink_target - i * s for i, s in zip(ink, scale)]
image = image * scale + offset

# find distance to white of each pixel ... small distances go to white
white = [100, 0, 0]
image = image.colourspace("lab")
d = image.dE76(white)
image = (d < 12).ifthenelse(white, image)

# boost saturation (scale ab)
image = image * [1, 2, 2]

image.write_to_file(sys.argv[2])

It removes low frequences (ie. paper folds etc.), stretches the contrast range, finds pixels close to white in CIELAB and moves them to white, and boosts saturation.

You'd probably need to tune it a bit for your use-case. Post some sample images if you need more advice.

jcupitt
  • 10,213
  • 2
  • 23
  • 39
1

I'm no expert in this area, but maybe try changing all pixels with RGB values below a certain threshold to black, or delete them? As I mentioned before, I'm not very knowledgeable in any of this, but I don't see why this wouldn't work.

Isaac
  • 51
  • 1
  • 9
  • 1
    This makes sense, but that wouldn't work for other colors, right? I'd like to avoid manually going through each pixel and filtering them out based on RGB values – Thiago Belem Aug 07 '20 at 03:25
  • 1
    I think it would work for other colors, you'd probably have to establish an upper threshold too, though, and I can't think of a way to remove the noise/other colors without doing it pixel by pixel, although there are probably tools in imagemagick, vips, and whatever language you're using to do it a different way that I haven't thought or learned of. – Isaac Aug 07 '20 at 03:27
0

If the images are synthetic and uncompressed, you can test for strict equality of the RGB values. Otherwise, use a threshold on the distance between the RGB triples (Euclidean or Manhattan for instance).

If you want to allow variations in the lightness but not in the color, you can convert to HLS and compare HS.