Python - OCR - pytesseract for PDF

Question

I am trying to run the following code:

import cv2
import pytesseract

img = cv2.imread('/Users/user1/Desktop/folder1/pdf1.pdf')
text = pytesseract.image_to_string(img)
print(text)

which gives me the following error:

Traceback (most recent call last):
  File "/Users/user1/PycharmProjects/project1/python_file.py", line 5, in <module>
    text = pytesseract.image_to_string(img)
  File "/Users/user1/PycharmProjects/project1/venv/lib/python3.8/site-packages/pytesseract/pytesseract.py", line 346, in image_to_string
    return {
  File "/Users/user1/PycharmProjects/project1/venv/lib/python3.8/site-packages/pytesseract/pytesseract.py", line 349, in <lambda>
    Output.STRING: lambda: run_and_get_output(*args),
  File "/Users/user1/PycharmProjects/project1/venv/lib/python3.8/site-packages/pytesseract/pytesseract.py", line 249, in run_and_get_output
    with save(image) as (temp_name, input_filename):
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/contextlib.py", line 113, in __enter__
    return next(self.gen)
  File "/Users/user1/PycharmProjects/project1/venv/lib/python3.8/site-packages/pytesseract/pytesseract.py", line 172, in save
    image, extension = prepare(image)
  File "/Users/user1/PycharmProjects/project1/venv/lib/python3.8/site-packages/pytesseract/pytesseract.py", line 142, in prepare
    raise TypeError('Unsupported image object')
TypeError: Unsupported image object

How can I make it work for a PDF file?

score 25 · Answer 1 · answered Mar 19 '20 at 10:29

25

This worked for me:

import os
from PIL import Image
from pdf2image import convert_from_path
import pytesseract

filePath = '/Users/user1/Desktop/folder1/pdf1.pdf'
doc = convert_from_path(filePath)
path, fileName = os.path.split(filePath)
fileBaseName, fileExtension = os.path.splitext(fileName)

for page_number, page_data in enumerate(doc):
    txt = pytesseract.image_to_string(Image.fromarray(page_data)).encode("utf-8")
    print("Page # {} - {}".format(str(page_number),txt))

answered Mar 19 '20 at 10:29

Lambo

1,094
11
18

3

I get error : a bytes-like object is required, not 'PpmImageFile' – Sheetal Mangesh Pandrekar Apr 23 '21 at 08:41
10

If you get the error "a bytes-like object is required, not 'PpmImageFile'" change the second last line into: `txt = pytesseract.image_to_string(page_data).encode("utf-8")` – jboi May 24 '21 at 11:03
if any one is getting the below error AttributeError: __array_interface__ then first convert page_data to np array – Amarnath Dec 27 '22 at 19:46

Python - OCR - pytesseract for PDF

1 Answers1

Linked