4
import pytesseract
from pdf2image import convert_from_path, convert_from_bytes
import cv2,numpy
def pil_to_cv2(image):
    open_cv_image = numpy.array(image)
    return open_cv_image[:, :, ::-1].copy() 


path='OriginalsFile.pdf'
images = convert_from_path(path)
cv_h=[pil_to_cv2(i) for i in images]
img_header = cv_h[0][:160,:]
#print(pytesseract.image_to_string(Image.open('test.png'))) I only found this in tesseract docs

Hello, is there a way to read the img_header directly using pytesseract without saving it,

pytesseract docs

  • 1
    Where do you save it? And with the commented (and missing) code, it does what you expect it to? So you want **your** code not to use the image filename (note that backend code could still use some temporary files)? – CristiFati May 06 '20 at 11:16

2 Answers2

0

pytesseract.image_to_string() input format

As documentation explains pytesseract.image_to_string() needs a PIL image as input. So you can convert your CV image into PIL one easily, like this:

from PIL import Image
... (your code)
print(pytesseract.image_to_string(Image.fromarray(img_header)))

if you really don't want to use PIL!

see: https://github.com/madmaze/pytesseract/blob/master/src/pytesseract.py

pytesseract is an easy wrapper to run the tesseract command def run_and_get_output() line, you'll see that it saves your image into an temporary file, and then gives the address to the tesseract to run.

hence, you can do the same with opencv, just rewrite the pytesseract only .py file to do it with opencv, although; i don't see any performance improvements whatsoever.

a-sam
  • 481
  • 3
  • 8
  • I was actually hoping to skip the use PIL alltogether –  May 12 '20 at 20:27
  • ok, please check the edit, if you don't know how to change pytesseract's module to match opencv or else, please comment, i would be happy to comply – a-sam May 13 '20 at 12:56
0

The fromarray function allows you to load the PIL document into tesseract without saving the document to disk, but you should also ensure that you don`t send a list of pil images into tesseract. The convert_from_path function can generate a list of pil images if a pdf document contains multiple pages, therefore you need to send each page into tesseract individually.

import pytesseract
from pdf2image import convert_from_path
import cv2, numpy

def pil_to_cv2(image):
    open_cv_image = numpy.array(image)
    return open_cv_image[:, :, ::-1].copy()

doc = convert_from_path(path)

for page_number, page_data in enumerate(doc):
    cv_h= pil_to_cv2(page_data)
    img_header = cv_h[:160,:]
    print(f"{page_number} - {pytesseract.image_to_string(Image.fromarray(img_header))}")

Lambo
  • 1,094
  • 11
  • 18