How to remove the noise in the given image so that the ocr output be perfect?

Question

I have done otsu thresholding on this bengali text image and use tesseract to OCR but the output is very bad. What preprocessing should I apply to remove the noise? I want to deskew the image as well, as it has slight skewed. My code is given below

import tesserocr
from PIL import Image
import cv2
import codecs
image = cv2.imread("crop2.bmp", 0)
(thresh, bw_img) = cv2.threshold(image, 128, 255, cv2.THRESH_BINARY | cv2.THRESH_OTSU)

img = Image.fromarray(bw_img)
text = tesserocr.image_to_text(img, lang='ben')
file = codecs.open("output_text", "w", "utf-8")
file.write(text)
file.close()

provide your input image... maybe Otsu is not a good choice. — Piglet, Jan 09 '18 at 22:12
Is that really the original image? It looks like it has gone through a fax machine! This is a problem much more difficult than what can be answered here, I think. — Cris Luengo, Jan 10 '18 at 02:15
Hi @dojo if this or any answer has solved your question please consider accepting it by clicking the check-mark. This indicates to the wider community that you've found a solution and gives some reputation to both the answerer and yourself. There is no obligation to do this. — Dmitrii Z., Jan 11 '18 at 15:25

score 1 · Answer 1 · answered Jan 10 '18 at 10:25

You can remove the noises by removing small connected components that might improve the accuracy. You would also need to get optimum value for noisy components threshold value.

import cv2 
import numpy as np

img = cv2.imread(r'D:\Image\st5.png',0)
ret, bw = cv2.threshold(img, 128,255,cv2.THRESH_BINARY_INV)

connectivity = 4
nb_components, output, stats, centroids = cv2.connectedComponentsWithStats(bw, connectivity, cv2.CV_32S)
sizes = stats[1:, -1]; nb_components = nb_components - 1
min_size = 50 #threshhold value for small noisy components
img2 = np.zeros((output.shape), np.uint8)

for i in range(0, nb_components):
    if sizes[i] >= min_size:
        img2[output == i + 1] = 255

res = cv2.bitwise_not(img2)

Denoised image:

How to remove the noise in the given image so that the ocr output be perfect?

1 Answers1

Linked