I'm comparing OCR tools in Python to convert pdf to text and I've been using pdf2image along with pytesseract and easyOCR in order to convert them to txt files. They both take a while, pytesseract taking around 3-4 seconds per page and easyOCR taking about 44 seconds per page. I've imported these
from PIL import Image
import pytesseract
from pdf2image import convert_from_path
import os
import time
import numpy
pytesseract.pytesseract.tesseract_cmd = r'/opt/homebrew/bin/tesseract'
pdf_path = r"example1.pdf"
pages = convert_from_path(pdf_path)
with open('pytesseract.txt', 'w') as f:
for i in range(len(pages)):
f.write(pytesseract.image_to_string(pages[i]))
For EasyOCR:
reader = easyocr.Reader(['en'])
for i in range(1):
result = reader.readtext(numpy.array(pages[i]), detail=0)
print(result)
I'm not sure if this is a code problem, CPU or GPU problem or something else. I'm using the MacBook Pro M1 pro, if that's of any info.