3

I am dealing with a kind of captchas with some noisy stripes. They are drawn in a random direction and they are straight. The color of digits and stripes are truly random.

enter image description here enter image description here enter image description here

The code below is able to recognize digits from some captchas with the help of tesseract.

from pytesser.pytesser import *
from PIL import Image, ImageFilter, ImageEnhance

im = Image.open("test.tiff")
im = im.filter(ImageFilter.MedianFilter()) # blur the image, the stripes will be erased
im = ImageEnhance.Contrast(im).enhance(2)  # increase the contrast (to make image clear?)
im = im.convert('1')                       # convert to black-white image
text = image_to_string(im)
print "text={}".format(text)

The approach of removing stripes is to blur the image first and then re-sharp it. The accuracy of the recognition is 100% in most case, but I'm thinking if there are some other approaches to remove stripes without blurring the digits.

Any hints are highly appreciated.

Community
  • 1
  • 1
stanleyxu2005
  • 8,081
  • 14
  • 59
  • 94

3 Answers3

3

Why not try to leverage how thin the stripes are? I'd guess they're at most 5px. So why not do something like (rough pseudocode):

  1. Convert your image to a numpy array
  2. For direction in UP, DOWN, LEFT, RIGHT
    1. Make a new numpy array shifted 5px in direction, cropping off the edge.
    2. AND together your new array and old array.
    3. Check the bottom left corner. If it's white, your done and your image is denoised. If not, try the next direction.

Given that the numbers are much thicker than the stripes, my guess would be that clearing out the stripes from the image would outweigh any distortion introduced from the AND.

Patrick Collins
  • 10,306
  • 5
  • 30
  • 69
  • 1
    Hm, this looks more or less the same as the blur-unblur code that you just posted. My other guess would be to train up an algorithm to recognize individual numbers and run that on individual number-sized chunks. [There's some really good machine learning research on the topic](http://yann.lecun.com/exdb/mnist/), although there's no noise in those samples. That would move you away from `tesseract`, though. – Patrick Collins Jun 18 '14 at 08:31
  • Does training mean to provide a lot of captchas with correct answers? As I am lack background knowledge of machine learning and due to a tight schedule, I did not invest time in this direction. I've tried to composite blur-unblur image with the original to remove stripes. It works, but the digits in composite image look very light, which affects the OCR result more than those stripes. Currently I have improved the accuracy with this approach: (1) collect the colors of four corners (2) replace these colors with bgcolor image-wide. (3) run the original image processing with the new image. – stanleyxu2005 Jun 19 '14 at 03:53
  • 1
    @stanleyxu2005 You wouldn't need whole captchas with correct answers, just single digits. But yeah, you'd need a decent-sized training set. I did a project for school that involved recognizing the database linked above -- the basic idea is that your algorithm keeps a mental "image" of what each number looks like, and since the digit parts are consistently placed, they get "burned in" on a particular location, and the noise is ignored. Since the digits in your image always look identical, I would expect it to work well. But that involves re-implementing your OCR. – Patrick Collins Jun 19 '14 at 04:26
1

The second sample is very easy: scan the edges to identify the color of the stripes and turn this color to white. (These colored lines are not a robust captcha feature.)

The first and third raise a more serious issue because the stripes have the same color than some characters. You can deal with that by erasing only pixels of the color of the stripes having few neighbors. Even better is to analyze the image outline to identify the direction of the stripes and see what neighborhood configurations correspond to a stripe pixel.

Technically speaking, you will perform an erosion operation with a suitable structuring element shape.

  • As I know the dimension of a captcha, my naive idea is to scan 5px around the border and guess the color of stripes. After that I might be able to remove the stripes. But I think some magic PIL methods can help to remove stripes perfect clean. – stanleyxu2005 Jun 18 '14 at 08:08
  • 1
    You will never achieve perfect results at places where a character stroke looks like a stripe. Anyway, to start with, you can extract a binary image where the original image is of the same color as the stripes (it will contain just the stripes and a few characters, in black). Then pass a max filter to erase the stripes, followed by a min filter to (approximately) recover the shape of the characters. Where the binary image and the min of max differ, you can erase. –  Jun 18 '14 at 08:19
  • 1
    I have taken a closer look. Due to the small resolution, the standard morphological operation will damage the characters too much. And the characters being anti-aliased, it works poorly. I now find it much better just to erase the color of the stripes. But in the given samples there was probably some lossy compression, so that after decompression the RGB values are no more constant. If this is the case in the original captchas, you need to detect the stripe color with some tolerance on the RGB values. –  Jun 18 '14 at 08:32
1

There is a class of math problems in image proccessing named "Inpainting".

You should get a mask of stripes somehow to do things.

Here is my library of articles: http://dpaste.com/0CZ25FT . All of modern publications are there.

Couple of algorithms are implemented in OpenCV: "Navier-Stokes" and "Telea", but they aren't good for inpainting large regions.

You can also find some references to inpainting in SciKit, but no finished algorithms there.

Also, if stripes are always 1 pix wide, they can be easly removed via dilation+erosion. Check Woods, Gonzalez "Digital image processing" for more info.

soupault
  • 6,089
  • 4
  • 24
  • 35