0

I have this image.

enter image description here

Running Pytesseract with python 3.8 produced follwoing problem:

  1. The word "phone" is read as O (not zero, O as in oscar)
  2. The word "Fax" is read as 2%.
  3. The phone number is read as (56031770

The image in consideration does not contain the boxes.The boxes are taken from the cv2 output after applying boxes around detected text regions / words.

The fax number is read without a problem. (866)357-7704 (includeing the parentheses and the hyphen)

The image size is 23 megapixels (converted from a pdf file) The image has been preporcessed with a threshholding in opencv so that you get a binary image The image does not contain bold fonts. So I did not use any erosion.

What can I do to properly read the Phone Number? Thank you.

PS: I am using image_to_data (not image_to_text) as I would need to know the locations of the strings on the page as well.

Edit: here is the relevant part of code:

from PIL import Image
import pytesseract
from pytesseract import Output
import argparse
import cv2
import os
import numpy as np
import math
from pdf2image import convert_from_path 
from scipy.signal import convolve2d
import string

filename = "image.png"
image = cv2.imread(filename)
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)

# estimate noise on image
H, W = gray.shape
M = [[1, -2, 1],
    [-2, 4, -2],
    [1, -2, 1]]

sigma = np.sum(np.sum(np.absolute(convolve2d(gray, M))))
sigma = sigma * math.sqrt(0.5 * math.pi) / (6 * (W-2) * (H-2))

# if image has too much noise then go with blurring method

if sigma > 10 :
    # noisy
    gray = cv2.medianBlur(gray, 3)
    print("noises deblurred")
# otherwise go with threshholding method
else :
    gray = cv2.threshold(gray, 0, 255,cv2.THRESH_BINARY | cv2.THRESH_OTSU)[1]
    print("threshhold applied")


d = pytesseract.image_to_data(gray, output_type=Output.DICT)
for t in d['text'] :
    print(t)

This will thus be psm 3 (default)

Version :

Tesseract : tesseract 4.1.1 (retrieved with tesseract --version) & pytessract : Version: 0.3.2 (retrieved with pip3 show pytesseract)

Sean
  • 789
  • 6
  • 26

0 Answers0