I am using this repository to deploy tesseract as a lambda layer: https://github.com/bweigel/aws-lambda-tesseract-layer
The deployment works well and other functions that pytesseract
has like: image_to_string
, image_to_data
also works well without any hiccups.
But, when I try to use image_to_pdf_or_hocr
like this:
pdf = pytesseract.image_to_pdf_or_hocr(f'/tmp/{file_name}/{page.number}.png', extension='pdf')
it does not work and throws error like:
Traceback (most recent call last):
File "/var/task/helpers/ocr_helper.py", line 36, in save_searchable_pdf
f'/tmp/{file_name}/{page.number}.png', extension='pdf')
File "/var/task/pytesseract/pytesseract.py", line 432, in image_to_pdf_or_hocr
return run_and_get_output(*args)
File "/var/task/pytesseract/pytesseract.py", line 289, in run_and_get_output
with open(filename, 'rb') as output_file:
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tess_6_hu78b0.pdf'
- It says that the file
tess_6_hu78b0.pdf
does not exist. What does this mean? I have no file withtess_6_hu78b0
name to begin with. - The path that I am passing to
image_to_pdf_or_hocr
function is 100% correct and an image is present there. I have confirmed and the same thing works on my local.
I have tried:
I found somewhere that I needed to install libtesseract-dev
too. Hence, I modified my dockerfile as:
FROM lambci/lambda:build-python3.6
RUN sudo apt install tesseract-ocr
RUN sudo apt install libtesseract-dev
but unfortunately this too did not work.