Speed Performance of pytesseract on PDF (comparing to existing pdf-ocr library in python)

Question

I am a beginner on OCR projects and currently looking into different ways in python to get the OCR-ed text in pdf.

One simple and popular way seems to be the pytesseract library by converting the pdf file into png /jpg first. I also try libraries that enables pdf ocr features, such as pymupdf (fitz)

I, surprising, found that it achieve a much higher speed(~2 time faster) compare to pytesseract even it also use tesseract engine for the OCR task: Doc. Without detailed inspection of the code in the library(due to the complexity), i am not sure about the main reason that cause the big difference. I guess it is something related to the image input format? (as Tesseract uses Leptonica library to handle input image?)

I prefer using pytesseract library as it enables preprocessing/confidence level threshold and i believe there should be ways to further enhance the performance of it. Can anyone suggest any way for speeding up my pytesseract code?

My code for speed testing:

import pdf2image
import io
import time
import cv2
import numpy as np
import pytesseract as pt
import fitz
import os

directory = "../testpdf"
FILE_LIST = [
    os.path.join(directory, f) for f in os.listdir(directory) if f.endswith(".pdf")
]

fitz_time = []
pt_time = []


def fitz_ocr():
    doc = fitz.open(file_path)

    for page_index, page in enumerate(doc):
        tp = page.get_textpage_ocr(
            flags=0,
            full=True,
            dpi=300,
        )
        dict = page.get_text("dict", textpage=tp)


def pt_ocr():
    pages = pdf2image.convert_from_path(file_path, dpi=300, grayscale=True)
    for page_index, page in enumerate(pages):
        in_mem_file = io.BytesIO()
        page.save(in_mem_file, format="png")
        in_mem_file.seek(0)
        img_origin = cv2.imdecode(np.frombuffer(in_mem_file.read(), np.uint8), 1)

        text = pt.image_to_data(
            img_origin,
            config=r"-l eng --psm 6",
        )


for file_path in FILE_LIST:
    st = time.time()
    fitz_ocr()
    done_time = time.time() - st
    print(f"fitz: {done_time}", end=" ")
    fitz_time.append(done_time)

    st = time.time()
    pt_ocr()
    done_time = time.time() - st
    pt_time.append(done_time)

    print(f"pt: {done_time}")

print(f"avg fitz: {sum(fitz_time)/len(fitz_time)}, avg pt: {sum(pt_time)/len(pt_time)}")

and its result in my local machine

fitz: 1.113755464553833 pt: 2.4535179138183594
fitz: 6.783350229263306 pt: 18.1472225189209
fitz: 1.1973145008087158 pt: 2.1595921516418457
fitz: 1.1768627166748047 pt: 2.162658452987671
fitz: 1.1746160984039307 pt: 2.0023140907287598
fitz: 3.0561563968658447 pt: 6.202923536300659
fitz: 1.1177668571472168 pt: 2.0603621006011963
fitz: 1.3792881965637207 pt: 2.8750576972961426
fitz: 3.2603485584259033 pt: 7.149296760559082
fitz: 0.8049216270446777 pt: 1.6897962093353271

avg fitz: 2.1064380645751952, avg pt: 4.6902741432189945

Speed Performance of pytesseract on PDF (comparing to existing pdf-ocr library in python)

0 Answers0