0

I'm comparing OCR tools in Python to convert pdf to text and I've been using pdf2image along with pytesseract and easyOCR in order to convert them to txt files. They both take a while, pytesseract taking around 3-4 seconds per page and easyOCR taking about 44 seconds per page. I've imported these

from PIL import Image
import pytesseract
from pdf2image import convert_from_path
import os
import time
import numpy
pytesseract.pytesseract.tesseract_cmd = r'/opt/homebrew/bin/tesseract'

pdf_path = r"example1.pdf"
pages = convert_from_path(pdf_path)
with open('pytesseract.txt', 'w') as f:
    for i in range(len(pages)):
        f.write(pytesseract.image_to_string(pages[i]))

For EasyOCR:

reader = easyocr.Reader(['en'])
for i in range(1):
    result = reader.readtext(numpy.array(pages[i]), detail=0)
    print(result)

I'm not sure if this is a code problem, CPU or GPU problem or something else. I'm using the MacBook Pro M1 pro, if that's of any info.

  • you can try lowering resolution of input images. That might give you still good results (maybe sacraficing only a bit of translation quality) while being much faster – dankal444 Aug 03 '23 at 09:56
  • 3-4 second does not seems so much considering that tesseract was considered as a good OCR (not so much recently AFAIK) and that AFAIK it only runs on CPU. Not to mention the pdf certainly needs to be rasterised in the first place (pretty slow regarding the engine used). GPU-based OCR should be faster. There are certainly faster OCRs but also less accurate ones (there is a tread off between the two). Is it fine for you to reduce the accuracy for better performance? – Jérôme Richard Aug 03 '23 at 18:42

0 Answers0