Currently want to distribute text detection on Google Dataflow on a huge dataset. I'm using the python package of tesseract which gets installed without a problem. The problem occurs when installing the tesseract-ocr package. It seems like it's installing an older version of tesseract-ocr.
I've already tried adding a version number to the package or installing through a tar.gz file. Also tried using ppa-alex-p package manager.
ppa:alex-p:
CUSTOM_COMMANDS = [
['add-apt-repository', 'ppa:alex-p/tesseract-ocr'],
['apt-get', 'update'],
['apt-get', '--assume-yes', 'install', 'tesseract-ocr'],
['pip', 'install', 'pytesseract'],
['pip', 'install', 'opencv-python'],
['pip', 'install', 'pytesseract'],
['pip', 'install', 'tensorflow']
]
Version number:
CUSTOM_COMMANDS = [
['apt-get', 'update'],
['apt-get', '--assume-yes', 'install', 'tesseract-ocr=3.05.00'],
['pip', 'install', 'pytesseract'],
['pip', 'install', 'opencv-python'],
['pip', 'install', 'pytesseract'],
['pip', 'install', 'tensorflow']
]
Installing through a file:
dataflow_options = {
'runner': 'DataflowRunner',
'job_name': job_name,
'staging_location': STAGING_LOCATION,
'temp_location': TEMP_LOCATION,
'project': PROJECT_ID,
'service_account_email': SERVICE_ACCOUNT,
'region': 'europe-west1',
'zone': 'europe-west1-d',
'machine_type': 'n1-standard-8',
'autoscaling_algorithm': 'THROUGHPUT_BASED',
'save_main_session': True,
'setup_file': './setup.py',
'extra_package': './tesseract-4.0.0.tar.gz',
}
The CUSTOM_COMMANDS are executed with the following code on this link. https://gist.github.com/inchoate/bd0ff7f609f57c85d9de8ff9d5586e30
Hope to see an installed package with the latest version on Google Dataflow.