0

enter image description here

I have done otsu thresholding on this bengali text image and use tesseract to OCR but the output is very bad. What preprocessing should I apply to remove the noise? I want to deskew the image as well, as it has slight skewed. My code is given below

import tesserocr
from PIL import Image
import cv2
import codecs
image = cv2.imread("crop2.bmp", 0)
(thresh, bw_img) = cv2.threshold(image, 128, 255, cv2.THRESH_BINARY | cv2.THRESH_OTSU)

img = Image.fromarray(bw_img)
text = tesserocr.image_to_text(img, lang='ben')
file = codecs.open("output_text", "w", "utf-8")
file.write(text)
file.close()
durjoy
  • 1,709
  • 1
  • 14
  • 25
  • provide your input image... maybe Otsu is not a good choice. – Piglet Jan 09 '18 at 22:12
  • I have added input image in the top of the question – durjoy Jan 09 '18 at 22:14
  • Is that really the original image? It looks like it has gone through a fax machine! This is a problem much more difficult than what can be answered here, I think. – Cris Luengo Jan 10 '18 at 02:15
  • Hi @dojo if this or any answer has solved your question please consider accepting it by clicking the check-mark. This indicates to the wider community that you've found a solution and gives some reputation to both the answerer and yourself. There is no obligation to do this. – Dmitrii Z. Jan 11 '18 at 15:25

1 Answers1

1

You can remove the noises by removing small connected components that might improve the accuracy. You would also need to get optimum value for noisy components threshold value.

import cv2 
import numpy as np

img = cv2.imread(r'D:\Image\st5.png',0)
ret, bw = cv2.threshold(img, 128,255,cv2.THRESH_BINARY_INV)

connectivity = 4
nb_components, output, stats, centroids = cv2.connectedComponentsWithStats(bw, connectivity, cv2.CV_32S)
sizes = stats[1:, -1]; nb_components = nb_components - 1
min_size = 50 #threshhold value for small noisy components
img2 = np.zeros((output.shape), np.uint8)

for i in range(0, nb_components):
    if sizes[i] >= min_size:
        img2[output == i + 1] = 255

res = cv2.bitwise_not(img2)

Denoised image:

enter image description here

flamelite
  • 2,654
  • 3
  • 22
  • 42