Can't get the johnsnow OCR notebook run on databricks

Question

So I am trying to follow this notebook and get it to work on a databricks notebook: https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/jupyter/ocr-spell/OcrSpellChecking.ipynb ; However, after installing all the packages, I still get stuck by the time I get to

{ // for displaying
val regions = data.select("region").collect().map(_.get(0))
regions.foreach{chunk =>
    println("---------------")
    println(chunk)}
}

Error message is:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 3.0 failed 4 times, most recent failure: Lost task 0.3 in stage 3.0 (TID 51, 10.195.249.145, executor 4): java.lang.NoClassDefFoundError: Could not initialize class net.sourceforge.tess4j.TessAPI

Anyone knows why? Much appreciated!

I just attached the jars as library to the cluster. And i tried import all the packages and functions needed by OcrHelper, they all worked. — Kay, Dec 20 '18 at 05:37

Maziyar · Answer 1 · 2019-09-08T09:24:02.290

To use Spark NLP OCR you need to install Tesseract 4.x+ as the documentation has stated. In the cluster, you must have this on all the nodes. However, if you are just dealing with PDF and not scanned images you can probably skip Tesseract 4.x+ installation:

import com.johnsnowlabs.nlp.util.io.OcrHelper
val ocrHelper = new OcrHelper()

val df = ocrHelper.createDataset(spark, "/tmp/Test.pdf")

Update: There is a new doc for Spark OCR and special instructions for Databricks:

https://nlp.johnsnowlabs.com/docs/en/ocr

Can't get the johnsnow OCR notebook run on databricks

1 Answers1