I am a beginner on OCR projects and currently looking into different ways in python to get the OCR-ed text in pdf.
One simple and popular way seems to be the pytesseract library by converting the pdf file into png /jpg first. I also try libraries that enables pdf ocr features, such as pymupdf (fitz)
I, surprising, found that it achieve a much higher speed(~2 time faster) compare to pytesseract even it also use tesseract engine for the OCR task: Doc. Without detailed inspection of the code in the library(due to the complexity), i am not sure about the main reason that cause the big difference. I guess it is something related to the image input format? (as Tesseract uses Leptonica library to handle input image?)
I prefer using pytesseract library as it enables preprocessing/confidence level threshold and i believe there should be ways to further enhance the performance of it. Can anyone suggest any way for speeding up my pytesseract code?
My code for speed testing:
import pdf2image
import io
import time
import cv2
import numpy as np
import pytesseract as pt
import fitz
import os
directory = "../testpdf"
FILE_LIST = [
os.path.join(directory, f) for f in os.listdir(directory) if f.endswith(".pdf")
]
fitz_time = []
pt_time = []
def fitz_ocr():
doc = fitz.open(file_path)
for page_index, page in enumerate(doc):
tp = page.get_textpage_ocr(
flags=0,
full=True,
dpi=300,
)
dict = page.get_text("dict", textpage=tp)
def pt_ocr():
pages = pdf2image.convert_from_path(file_path, dpi=300, grayscale=True)
for page_index, page in enumerate(pages):
in_mem_file = io.BytesIO()
page.save(in_mem_file, format="png")
in_mem_file.seek(0)
img_origin = cv2.imdecode(np.frombuffer(in_mem_file.read(), np.uint8), 1)
text = pt.image_to_data(
img_origin,
config=r"-l eng --psm 6",
)
for file_path in FILE_LIST:
st = time.time()
fitz_ocr()
done_time = time.time() - st
print(f"fitz: {done_time}", end=" ")
fitz_time.append(done_time)
st = time.time()
pt_ocr()
done_time = time.time() - st
pt_time.append(done_time)
print(f"pt: {done_time}")
print(f"avg fitz: {sum(fitz_time)/len(fitz_time)}, avg pt: {sum(pt_time)/len(pt_time)}")
and its result in my local machine
fitz: 1.113755464553833 pt: 2.4535179138183594
fitz: 6.783350229263306 pt: 18.1472225189209
fitz: 1.1973145008087158 pt: 2.1595921516418457
fitz: 1.1768627166748047 pt: 2.162658452987671
fitz: 1.1746160984039307 pt: 2.0023140907287598
fitz: 3.0561563968658447 pt: 6.202923536300659
fitz: 1.1177668571472168 pt: 2.0603621006011963
fitz: 1.3792881965637207 pt: 2.8750576972961426
fitz: 3.2603485584259033 pt: 7.149296760559082
fitz: 0.8049216270446777 pt: 1.6897962093353271
avg fitz: 2.1064380645751952, avg pt: 4.6902741432189945