Clean text images with OpenCV for OCR reading

Question

I received some images that need to be treated in order to OCR some information out of them. Here are the originals:

original 1

original 1

original 2

original 2

original 3

original 3

original 4

original 4

After processing them with this code:

img = cv2.imread('original_1.jpg', 0) 
ret,thresh = cv2.threshold(img,55,255,cv2.THRESH_BINARY)
opening = cv2.morphologyEx(thresh, cv2.MORPH_OPEN, cv2.getStructuringElement(cv2.MORPH_RECT,(2,2)))
cv2.imwrite('result_1.jpg', opening)

I get these results:

result 1

result 1

result 2

result 2

result 3

result 3

result 4

result 4

As you can see, some images get nice results for OCR reading, other still maintain some noise in the background.

Any suggestions as how to clean up the background?

Read up on [ImproveQuality](https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality) — stovfl, Jan 22 '20 at 18:17

stateMachine · Accepted Answer · 2020-01-23T22:12:26.423

MH304's answer is very nice and straightforward. In the case you can't use morphology or blurring to get a cleaner image, consider using an "Area Filter". That is, filter every blob that does not exhibit a minimum area.

Use opencv's connectedComponentsWithStats, here's a C++ implementation of a very basic area filter:

cv::Mat outputLabels, stats, img_color, centroids;

int numberofComponents = cv::connectedComponentsWithStats(bwImage, outputLabels, 
stats, centroids, connectivity);

std::vector<cv::Vec3b> colors(numberofComponents+1);
colors[i] = cv::Vec3b(rand()%256, rand()%256, rand()%256);

//do not count the original background-> label = 0:
colors[0] = cv::Vec3b(0,0,0);

//Area threshold:
int minArea = 10; //10 px

for( int i = 1; i <= numberofComponents; i++ ) {

    //get the area of the current blob:
    auto blobArea = stats.at<int>(i-1, cv::CC_STAT_AREA);

    //apply the area filter:
    if ( blobArea < minArea )
    {
        //filter blob below minimum area:
        //small regions are painted with (ridiculous) pink color
        colors[i-1] = cv::Vec3b(248,48,213);

    }

}

Using the area filter I get this result on your noisiest image:

**Additional info:

Basically, the algorithm goes like this:

Pass a binary image to connectedComponentsWithStats. The function will compute the number of connected components, matrix of labels and an additional matrix with statistics – including blob area.
Prepare a color vector of size “numberOfcomponents”, this will help visualize the blobs that we are actually filtering. The colors are generated randomly by the rand function. From a range 0 – 255, 3 values for each pixel: BGR.
Consider that the background is colored in black, so ignore this “connected component” and its color (black).
Set an area threshold. All blobs or pixels below this area will be colored with a (ridiculous) pink.
Loop thru all the found connected components (blobs), retrive the area for the current blob via the stats matrix and compare it to the area threshold.
If the area is below the threshold, color the blob pink (in this case, but usually you want black).

I think it will work, I'm just having trouble translating this to python — SteelMasimo, Jan 23 '20 at 21:20
@SteelMasimo I added some pointers to help you go through the algorithm. Sadly, I cannot help you with the Python conversion, as I work with C++, but hopefully, the additional info will help you port the algorithm! — stateMachine, Jan 23 '20 at 22:13
I managed to implement a python version for this, and it worked brilliantly. Thanks for the help! — SteelMasimo, Feb 05 '20 at 17:06

score 2 · Answer 2 · edited May 31 '20 at 10:14

This is a fully coded Python solution based on the direction provided by @eldesgraciado.

This code assumes that you are already working with the properly binarized white-on-black image (e.g. after grayscale conversion, black hat morphing and Otsu's thesholding) - OpenCV documentation recommends working with the binarized images with the white foreground when applying morphological operations and stuff like that.

num_comps, labeled_pixels, comp_stats, comp_centroids = \
    cv2.connectedComponentsWithStats(thresh_image, connectivity=4)
min_comp_area = 10 # pixels
# get the indices/labels of the remaining components based on the area stat
# (skip the background component at index 0)
remaining_comp_labels = [i for i in range(1, num_comps) if comp_stats[i][4] >= min_comp_area]
# filter the labeled pixels based on the remaining labels, 
# assign pixel intensity to 255 (uint8) for the remaining pixels
clean_img = np.where(np.isin(labeled_pixels,remaining_comp_labels)==True,255,0).astype('uint8')

The advantage of this solution is that it allows you to filter out the noise without negatively affecting the characters that may already be compromised.

I work with dirty scans that have the undesirable effects like merged characters and character erosion, and I found out the hard way that there is no free lunch - even a seemingly harmless opening operation with the 3x3 kernel and one iteration results in some character degradation (despite being very effective for removing the noise around the characters).

So if the character quality allows, blunt cleanup operations on the entire image (e.g. blurring, opening, closing) are OK, but if not - this should be done first.

P.S. One more thing - you should not be using a lossy format like JPEG when working with text images, use a lossless format like PNG instead.

score 0 · Answer 3 · answered Jan 22 '20 at 20:05

0

A little median filter got me this result:

The code (Opencv C++):

Mat im = imread("E:/4.jpg",0);
medianBlur(im, im, 3);
threshold(im, im, 70, 255, THRESH_BINARY_INV);
imshow("1", im);
waitKey(0);

answered Jan 22 '20 at 20:05

MeiH

1,763
11
17

score 0 · Answer 4 · answered Sep 30 '20 at 18:28

0

Use this, it will remove the noise:

cv2.bilateralFilter(img,9,75,75)

answered Sep 30 '20 at 18:28

Subhamp7

29
5

Clean text images with OpenCV for OCR reading

4 Answers4

Linked