0

I can not find an open source solution for OCRing images in PySpark. I know solutions like pytesseract exist, but not sure if they will play nicely with PySpark since the tesseract-ocr will need to be installed in the linux machines. Are there any open source OCR solutions that would play nicely with PySpark?

1 Answers1

0

I could not find a pure python library. pytesseract calls a linux library called tesseract-ocr which I was able to install on a Spark cluster. You can also install this on your Spark cluster fairly easily and it works well.

Here's an answer on how to install it on Databricks. I used global init scripts to install it:

How to install Tesseract OCR on Databricks