I have a captcha image that looks like this:
Using a utility called TesserCap from McAfee, I could apply a "chopping" filter to the image. (Before running it, I made sure there were only two colors in the image, white and black.) I was very impressed with the results of using that filter with a value of 2 in the text box. It accurately removed most of the noise but kept the main text, resulting in this:
I wanted to implement something like this on one of my own scripts, so I tried to find out what image processing library TesserCap used. I couldn't find anything; it turns out it uses its own code to process the image. I then read this whitepaper that explains exactly how the program works. It gave me the following description of what this chopping filter does:
If the contiguous number of pixels for given grayscale values are less than the number provided in the numeric box, the chopping filter replaces these sequences with 0 (black) or 255 (white) as per user choice. The CAPTCHA is analyzed in both horizontal and vertical directions and corresponding changes are made.
I am not sure I understand what it is doing. My script is in Python, so I tried using PIL to manipulate the pixels kind of like that quote described. It sounds kind of simple, but I failed, probably because I didn't really know what exactly the filter was doing:
(This is made from a slightly different captcha that uses a circular pattern.)
I also tried seeing if it could easily be done with ImageMagick's convert.exe. Their -chop option is something completely different. Using -median along with some -morphology commands helped to reduce some of the noise, but nasty dots appeared and the letters became very distorted. It wasn't nearly as simple as doing the chopping filter with TesserCap.
So, my question is as follows: how do I implement the chopping filter of TesserCap in Python, be it using PIL or ImageMagick? That chopping filter works much better than any of the alternatives I've tried, but I can't seem to replicate it. I've been working on this for hours and haven't figured anything out yet.