Pytessaract image_to_pdf_or_hocr function not working in AWS lambda

Question

I am using this repository to deploy tesseract as a lambda layer: https://github.com/bweigel/aws-lambda-tesseract-layer

The deployment works well and other functions that pytesseract has like: image_to_string, image_to_data also works well without any hiccups.

But, when I try to use image_to_pdf_or_hocr like this:

pdf = pytesseract.image_to_pdf_or_hocr(f'/tmp/{file_name}/{page.number}.png', extension='pdf')

it does not work and throws error like:

Traceback (most recent call last):
File "/var/task/helpers/ocr_helper.py", line 36, in save_searchable_pdf
f'/tmp/{file_name}/{page.number}.png', extension='pdf')
File "/var/task/pytesseract/pytesseract.py", line 432, in image_to_pdf_or_hocr
return run_and_get_output(*args)
File "/var/task/pytesseract/pytesseract.py", line 289, in run_and_get_output
with open(filename, 'rb') as output_file:
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tess_6_hu78b0.pdf'

It says that the file tess_6_hu78b0.pdf does not exist. What does this mean? I have no file with tess_6_hu78b0 name to begin with.
The path that I am passing to image_to_pdf_or_hocr function is 100% correct and an image is present there. I have confirmed and the same thing works on my local.

I have tried:

I found somewhere that I needed to install libtesseract-dev too. Hence, I modified my dockerfile as:

FROM lambci/lambda:build-python3.6
RUN sudo apt install tesseract-ocr
RUN sudo apt install libtesseract-dev

but unfortunately this too did not work.

score 3 · Answer 1 · answered Mar 16 '21 at 13:33

After 18 hours of hard work, I was finally able to figure it out.

It turns out that https://github.com/bweigel/aws-lambda-tesseract-layer is not bundled with all the necessary files for pytesseract.image_to_pdf_or_hocr() to run.

So what I did was, I build leptonica and tesseract from source and generated

configs folder
tessconfigs folder and
pdf.tiff file

These required files are available here: https://github.com/prameshbajra/tessdata

Inside https://github.com/bweigel/aws-lambda-tesseract-layer, under ready-to-use folder there is a directory named amazonlinux-1, and inside it, there is a folder named tesseract/share/tessdata. All you need to do is paste in the above listed files under this directory.

Just download this repo and replace the tessdata folder.

Note: This tessdata is build with tesseract 4.1.1

I hope this helps future readers. Happy coding.

Thank Benjamin Genz (@bweigel) for publishing this repo. You made our lives easier.

After searching a lot for this, I realised that when installing tesseract from Conda in Python, those files aren't installed either, therefore no PDFs are generated. Thanks! — Jose Vega, Sep 19 '21 at 18:27

score 0 · Answer 2 · answered Sep 02 '22 at 08:48

0

Adding this config argument fixed it for me, inspired by this solution :)

pytesseract.image_to_pdf_or_hocr("Image.png", extension="pdf", config = " -c tessedit_create_pdf=1")

answered Sep 02 '22 at 08:48

olly

323
3
10

Pytessaract image_to_pdf_or_hocr function not working in AWS lambda

2 Answers2