I can not find an open source solution for OCRing images in PySpark. I know solutions like pytesseract exist, but not sure if they will play nicely with PySpark since the tesseract-ocr will need to be installed in the linux machines. Are there any open source OCR solutions that would play nicely with PySpark?
Asked
Active
Viewed 849 times
0
-
PySpark doesn't replace Linux and still is able to run "locally installed" modules of each executor – OneCricketeer Feb 22 '22 at 22:40
-
Start here https://spark.apache.org/docs/latest/api/python/user_guide/python_packaging.html – OneCricketeer Feb 22 '22 at 22:42
1 Answers
0
I could not find a pure python library. pytesseract calls a linux library called tesseract-ocr which I was able to install on a Spark cluster. You can also install this on your Spark cluster fairly easily and it works well.
Here's an answer on how to install it on Databricks. I used global init scripts to install it:

Salar Satti
- 3
- 3