1

I am using pytesseract, pillow,cv2 to OCR an image and get the text present in the image. Since my input is a scanned PDF document, I first converted it into an image (JPEG) format and then tried extracting the text. I am only half way there. The input is a table and the titles are not being displayed, since the titles have a black background. I also tried getstructuringelement but unable to figure out a way Here is what I did-

import cv2
import os  
import numpy as np 
import pytesseract
#import pillow 

#Since scanned PDF can't be handled by pdf2image, convert the scanned PDF into a JPEG format using the below code- 
filename = path   
from pdf2image import convert_from_path 
pages = convert_from_path(filename, 500) for page in pages:
page.save("dest", 'JPEG')


imgname = "path" 
oriimg = cv2.imread(imgname,cv2.IMREAD_COLOR) 
cv2.imshow("original image", oriimg)
cv2.waitKey(0)


#img = cv2.resize(oriimg,None,fx=0.5,fy=0.5,interpolation=cv2.INTER_CUBIC) 
img = cv2.resize(oriimg,(700,1500),interpolation=cv2.INTER_AREA) 
#here length height  
cv2.imshow("lol", img) 
cv2.waitKey(0) 
cv2.imwrite("changed_dimensionsimgpath", img)


import PIL.Image  
image = cv2.imread(imgname,cv2.IMREAD_COLOR) 
grayedimg = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY) grayedimg = 
cv2.threshold(grayedimg, 0, 255, cv2.THRESH_BINARY | cv2.THRESH_OTSU)[1] 
cv2.imwrite("H://newim.jpg", grayedimg)


pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files (x86)\Tesseract- 
OCR\tesseract.exe"


text = pytesseract.image_to_string(PIL.Image.open("path"))
print(text)

My input table looks like below. The regions which have black background are not being identified by OCR and not being extracted as text. --input table

developer
  • 257
  • 1
  • 3
  • 15
  • Have u tried crop the table into 2 images, and call pytesseract to recognize text and finally assemble the text? – Lau Real Jan 14 '19 at 06:20
  • @LauReal, Thank you, I will try that. But the part where the image is dark, (black background), how do i read that specially after I convert it into grayscale? – developer Jan 14 '19 at 06:50
  • Did u mean each table has a different header in width and height? – Lau Real Jan 14 '19 at 07:02
  • @LauReal, No, I meant, the first line of the table (which reads product, unit sales,..) has a black background. I am unable to detect that and read it. This is only a sample image. I have a different image. and in that, this is the problem. black background text is not detected. Some numbers in the rows of the table are also not being detected. – developer Jan 14 '19 at 07:18
  • So the question becomes: How to detect table and the data inside within a scanned image? – Lau Real Jan 14 '19 at 08:02
  • I found an online [converter](https://smallseotools.com/image-to-text-converter/) for u, have a try. I've tried with ur uploaded image, even table header is convented. U can call a http request with Python `requests` lib and get the response FYI. – Lau Real Jan 14 '19 at 08:31
  • @LauReal, Yes, as mentioned, I have a scanned pdf and some part of the data isn't being scanned by the tesseract engine. – developer Jan 14 '19 at 08:44
  • @LauReal, I wish to code the engine myself, rather than using API's – developer Jan 14 '19 at 08:45
  • Then this maybe a question I can not offer u helpgood luck and wish u solve it ASAP. – Lau Real Jan 14 '19 at 08:56

1 Answers1

0

I have 3 possible ways from an image-analysis perspective

Splitting You can split the images in two part. First part is just your normal flow (load image, detect text on it). The second flow you first take the negative of the image (255 - img) and than detect text.

The two results will need to be merged afterwards.

difference filter You can first apply a difference filter/edge detection this will high everything with a high contrast BUT can alter the shape of the letters if done to extreme or if some letters are way bigger.

contour finding + filling Again an edge detection but now very thin and followed with an contour detection. This will redraw all letter in one color.

Tom Nijhof
  • 542
  • 4
  • 11