0

Currently want to distribute text detection on Google Dataflow on a huge dataset. I'm using the python package of tesseract which gets installed without a problem. The problem occurs when installing the tesseract-ocr package. It seems like it's installing an older version of tesseract-ocr.

I've already tried adding a version number to the package or installing through a tar.gz file. Also tried using ppa-alex-p package manager.

ppa:alex-p:

CUSTOM_COMMANDS = [
    ['add-apt-repository', 'ppa:alex-p/tesseract-ocr'],
    ['apt-get', 'update'],
    ['apt-get', '--assume-yes', 'install', 'tesseract-ocr'],

    ['pip', 'install', 'pytesseract'],
    ['pip', 'install', 'opencv-python'],
    ['pip', 'install', 'pytesseract'],
    ['pip', 'install', 'tensorflow']
]

Version number:

CUSTOM_COMMANDS = [
    ['apt-get', 'update'],
    ['apt-get', '--assume-yes', 'install', 'tesseract-ocr=3.05.00'],

    ['pip', 'install', 'pytesseract'],
    ['pip', 'install', 'opencv-python'],
    ['pip', 'install', 'pytesseract'],
    ['pip', 'install', 'tensorflow']
]

Installing through a file:

dataflow_options = {
        'runner': 'DataflowRunner',
        'job_name':  job_name,
        'staging_location': STAGING_LOCATION,
        'temp_location': TEMP_LOCATION,
        'project': PROJECT_ID,
        'service_account_email': SERVICE_ACCOUNT,
        'region': 'europe-west1',
        'zone': 'europe-west1-d',
        'machine_type': 'n1-standard-8',
        'autoscaling_algorithm': 'THROUGHPUT_BASED',
        'save_main_session': True,
        'setup_file': './setup.py',
        'extra_package': './tesseract-4.0.0.tar.gz',
    }

The CUSTOM_COMMANDS are executed with the following code on this link. https://gist.github.com/inchoate/bd0ff7f609f57c85d9de8ff9d5586e30

Hope to see an installed package with the latest version on Google Dataflow.

Jacob Verschaeve
  • 159
  • 2
  • 10

1 Answers1

0

What is the OS version you're trying to install tesseract on? I had the same error when trying to run tesseract on an older version of Ubuntu. Once run on 18.04 Bionic, installing tesseract with:

sudo apt install -y libtesseract-dev libleptonica-dev tesseract-ocr

should work. You might also want to install from Git.

Tomasz Bartkowiak
  • 12,154
  • 4
  • 57
  • 62