1

Problem:

I had tesseract installed in local machine and its path is at /usr/local/Cellar/tesseract/4.1.1/bin/tesseract. Everything works perfectly until I containerized it in docker with error message as: pytesseract.pytesseract.TesseractNotFoundError: is not installed or it's not your PATH

What I've tried:

Based on the error message, this is what I've tried:

1). Add PATH in docker desktop app under file sharing to /usr/local and mount the file path from local to docker - still getting the error message (doesn't work)

2). Move tesseract.exe from where it resides to current local working dir - still getting the error message(of course it doesn't work - what was I even thinking back then?)

3). Modify dockerfile to install tesseract with its dependencies. Here is the dockerfile:

FROM python:3.7-alpine
RUN apk update && apk add --no-cache tesseract-ocr
WORKDIR /app
COPY ./requirements.txt ./ 
RUN pip3 install --upgrade pip
# install dependencies 
RUN pip3 install -r requirements.txt
RUN pip3 install --upgrade PyMuPDF
# bundle app source 
COPY . /app

COPY ./ChaseOCR.py /app
COPY ./BancAmericaOCR.py /app
COPY ./WellsFargoOCR.py /app

EXPOSE 8080

CMD ["python3", "MainBankClass.py"] 

Under requirements.txt file, pytesseract and tesseract dependencies are also included. - still getting the error message (doesn't work). Being stuck on this issue in the past 2 days and kinda running out of options here. This link and this link both don't work on my case. Any help is much appreciated. Thanks in advance.

EDIT:

Thanks to Neo's solution and I am testing it now but its running very slowly. Thus I thought it would be better to share requirements.txt file here just in case other issues are non-related to tesseract.

requirements.txt:

numpy
pandas
opencv-python
Pillow
Image
pytesseract
tesseract
PyMuPDF
python-levenshtein
tabula-py

Local file dir:

testdockerfile
├─ .vscode
│  └─ settings.json
├─ BankofAmericaOCR.py
├─ ChaseOCR.py
├─ Dockerfile
├─ MainBankClass.py
|- __init__.py
├─ WellsFargoOCR.py
└─ requirements.txt

EDIT 2:

Just for future reference if anyone has the same issue as I did after implementing tesseract in docker and still getting TesseractNotFound issue. What you need to do is to comment out pytesseract.pytesseract.tesseract_cmd = r'/path/to/your/tesseract if you set the path to run it locally. After that, you also need to re-build the image and run that image in docker. It should be fine.

liamsuma
  • 156
  • 4
  • 19

1 Answers1

3

Edit 3:
Some of the python packages in requirements.txt have other prerequisites. With this Dockerfile it went successfully through the entire build process.

The trickiest part was to build opencv.
Credits to https://github.com/janza/docker-python3-opencv/blob/master/Dockerfile

.
├── Dockerfile
└── requirements.txt

Dockerfile:

FROM python:3.7

RUN apt-get update \
    && apt-get install -y \
        build-essential \
        cmake \
        git \
        wget \
        unzip \
        yasm \
        pkg-config \
        libswscale-dev \
        libtbb2 \
        libtbb-dev \
        libjpeg-dev \
        libpng-dev \
        libtiff-dev \
        libavformat-dev \
        libpq-dev \
    && rm -rf /var/lib/apt/lists/*

RUN pip install numpy

WORKDIR /
ENV OPENCV_VERSION="4.1.1"
RUN wget https://github.com/opencv/opencv/archive/${OPENCV_VERSION}.zip \
&& unzip ${OPENCV_VERSION}.zip \
&& mkdir /opencv-${OPENCV_VERSION}/cmake_binary \
&& cd /opencv-${OPENCV_VERSION}/cmake_binary \
&& cmake -DBUILD_TIFF=ON \
  -DBUILD_opencv_java=OFF \
  -DWITH_CUDA=OFF \
  -DWITH_OPENGL=ON \
  -DWITH_OPENCL=ON \
  -DWITH_IPP=ON \
  -DWITH_TBB=ON \
  -DWITH_EIGEN=ON \
  -DWITH_V4L=ON \
  -DBUILD_TESTS=OFF \
  -DBUILD_PERF_TESTS=OFF \
  -DCMAKE_BUILD_TYPE=RELEASE \
  -DCMAKE_INSTALL_PREFIX=$(python3.7 -c "import sys; print(sys.prefix)") \
  -DPYTHON_EXECUTABLE=$(which python3.7) \
  -DPYTHON_INCLUDE_DIR=$(python3.7 -c "from distutils.sysconfig import get_python_inc; print(get_python_inc())") \
  -DPYTHON_PACKAGES_PATH=$(python3.7 -c "from distutils.sysconfig import get_python_lib; print(get_python_lib())") \
  .. \
&& make install \
&& rm /${OPENCV_VERSION}.zip \
&& rm -r /opencv-${OPENCV_VERSION}
RUN ln -s \
  /usr/local/python/cv2/python-3.7/cv2.cpython-37m-x86_64-linux-gnu.so \
  /usr/local/lib/python3.7/site-packages/cv2.so

RUN apt-get --fix-missing update && apt-get --fix-broken install && apt-get install -y poppler-utils && apt-get install -y tesseract-ocr && \
    apt-get install -y libtesseract-dev && apt-get install -y libleptonica-dev && ldconfig && apt install -y libsm6 libxext6 && apt install -y python-opencv

COPY ./requirements.txt ./ 
RUN pip3 install --upgrade pip
# install dependencies 
RUN pip3 install -r requirements.txt

Build:

docker image build -t my-awesome-py .

Run:

docker run --rm my-awesome-py tesseract
Usage:
  tesseract --help | --help-extra | --version
  tesseract --list-langs
  tesseract imagename outputbase [options...] [configfile...]

OCR options:
  -l LANG[+LANG]        Specify language(s) used for OCR.
NOTE: These options must occur before any configfile.

Single options:
  --help                Show this help message.
  --help-extra          Show extra help for advanced users.
  --version             Show version information.
  --list-langs          List available languages for tesseract engine.
Neo Anderson
  • 5,957
  • 2
  • 12
  • 29
  • 1
    Thank you for your time and response. I am trying it now. – liamsuma Jul 31 '20 at 19:15
  • well it took too long to build the image and wasn't sure if thats because of requirements.txt file. I will share requirements.txt in edit to be more specific. One thing that I am sure of is only `tesseract` requires local dir. The image is still building as of now and will offer details when its done. – liamsuma Jul 31 '20 at 20:17
  • I only included `pytesseract` and `tesseract` in requirements.txt. Here is the error message after building the image: **Successfully installed Pillow-7.2.0 pytesseract-0.3.4 tesseract-0.1.3 sync /var/lib/docker/image/overlay2/layerdb/tmp/write-set-050086098/diff: input/output error** – liamsuma Jul 31 '20 at 20:33
  • The image build successfully on my end, but without the requirements(~60seconds). I ran a container and it seems that tesseract is installed correctly. Editing my answer. I'll try to find some time and reproduce with the requirements. – Neo Anderson Jul 31 '20 at 20:48
  • Please allow me to re-run it without the requirements and will update you on it shortly – liamsuma Jul 31 '20 at 20:55
  • Built successfully with pytesseract and tesseract in requirements. Editing again the answer to the latest successful build – Neo Anderson Jul 31 '20 at 20:56
  • Building now with full requirements.txt that you added in the edit of your question. Haven't seen it earlier. – Neo Anderson Jul 31 '20 at 21:03
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/219015/discussion-between-neo-anderson-and-liamsuma). – Neo Anderson Jul 31 '20 at 21:04
  • 1
    I would have upvoted this thousand times if I could. Neo is very knowledgable and I really appreciated your help mate – liamsuma Jul 31 '20 at 22:06