ocr a multipage pdf in python

Question

I am using pytesseract to OCR on images. I have statement pdf that are 3-4 page long. I need a way to convert them into multiple .jpg/.png images and to OCR on these images one by one. As of now, I am converting a single page to image and then I run

text=str(pytesseract.image_to_string(Image.open("imagename.jpg"),lang='eng'))

after which I use regex to extract information and create a dataframe. The regex logic is same for all the pages. Understandably if I can read the image files in a loop, the process can be automated for any pdf coming in same format.

liamsuma · Accepted Answer · 2020-06-19T13:42:00.837

PyMuPDF would be another option for you to loop through image files. Here is how you can achieve this:

import fitz
from PIL import Image
import pytesseract 

input_file = 'path/to/your/pdf/file'
pdf_file = input_file
fullText = ""

doc = fitz.open(pdf_file) # open pdf files using fitz bindings 
### ---- If you need to scale a scanned image --- ###
zoom = 1.2 # scale your pdf file by 120%
mat = fitz.Matrix(zoom, zoom)
noOfPages = doc.pageCount 

for pageNo in range(noOfPages):
    page = doc.loadPage(pageNo) # number of pages
    pix = page.getPixmap(matrix = mat) # if you need to scale a scanned image
    output = '/path/to/save/image/files' + str(pageNo) + '.jpg'
    pix.writePNG(output) # skip this if you don't need to render a page

    text = str(((pytesseract.image_to_string(Image.open(output)))))
    fullText += text

fullText = fullText.splitlines() # or do something here to extract information using regex

It's very handy depending on how you wanted to do with pdf files. For a more detailed information about PyMuPDF, these links might be helpful: tutorial on PyMuPDF and git for PyMuPDF

Hope this helps.

EDIT Another more straightforward way of doing this using PyMuPDF is to directly interpret the back-converted text if you have a clean format of PDF files, after page = doc.loadPage(pageNo) just do the following is suffice:

blocks = page.getText("blocks")
blocks.sort(key=lambda block: block[3])  # sort by 'y1' values

for block in blocks:
    print(block[4])  # print the lines of this block

Disclaimer: The above idea of using blocks was coming from the repo maintainer. A more detailed info can be found here: issues discussion on git

Thank you!! Will surely try . For me the following worked: imagefile = Image.open("test.tiff") for frame in range(0, imagefile.n_frames): imagefile.seek(frame) text+=str(pytesseract.image_to_string(imagefile) — Subhojyoti Lahiri, Jun 19 '20 at 12:35
Thanks for accepting the answer. Your solution is also very interesting to look at. I have never used wand module before but will look into this. For zoom factor, it would be in best use if you have a scanned image. If it's just a clean format of PDF files, use `page.getText(option)` is suffice. I will update it. — liamsuma, Jun 19 '20 at 13:27
is there a way to do this without saving and loading the image? It seems wasteful — CaptainCodeman, Jan 12 '21 at 14:32
@CaptainCodeman I believe the way you are looking for is reading files in stream and I think the package provides a way for you to do that. please see this link: https://pymupdf.readthedocs.io/en/latest/document.html — liamsuma, Jan 12 '21 at 21:56

score 0 · Answer 2 · edited Jan 25 '23 at 12:47

The answer from liamsuma seems to be deprecated.

This worked for me (Python 3.9):

import fitz
from PIL import Image
import pytesseract #Should be added to path

input_file = 'path/to/your/pdf/file.pdf'
full_text = ""
zoom = 1.2 

with fitz.open(input_file) as doc:
    mat = fitz.Matrix(zoom, zoom)
    for page in doc:
        pix = page.get_pixmap(matrix=mat)
        output = f'/path/to/save/image/files/{page.number}.jpg'
        pix.save(output)
        res = str(pytesseract.image_to_string(Image.open(output)))
        full_text += res

full_text = full_text.splitlines()
print(full_text)

Subhojyoti Lahiri · Answer 3 · 2020-06-17T13:20:16.633

For me the following works

from wand.api import library
from wand.image import Image
with Image(filename=r"imagepath.pdf", resolution=300) as img:


    library.MagickResetIterator(img.wand)
    for idx in range(library.MagickGetNumberImages(img.wand)):
        library.MagickSetIteratorIndex(img.wand, idx)

    img.save(filename="output.tiff")

Now the problem is to read each page in the tiff file.Because if I extract as

text=str(pytesseract.image_to_string(Image.open("test.tiff"),lang='eng'))

it will OCR only the first page

ocr a multipage pdf in python

3 Answers3