2

So I am trying to follow this notebook and get it to work on a databricks notebook: https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/jupyter/ocr-spell/OcrSpellChecking.ipynb ; However, after installing all the packages, I still get stuck by the time I get to

{ // for displaying
val regions = data.select("region").collect().map(_.get(0))
regions.foreach{chunk =>
    println("---------------")
    println(chunk)}
}

Error message is:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 3.0 failed 4 times, most recent failure: Lost task 0.3 in stage 3.0 (TID 51, 10.195.249.145, executor 4): java.lang.NoClassDefFoundError: Could not initialize class net.sourceforge.tess4j.TessAPI

Anyone knows why? Much appreciated!

Zain Farooq
  • 2,956
  • 3
  • 20
  • 42
Kay
  • 59
  • 1
  • 5

1 Answers1

2

To use Spark NLP OCR you need to install Tesseract 4.x+ as the documentation has stated. In the cluster, you must have this on all the nodes. However, if you are just dealing with PDF and not scanned images you can probably skip Tesseract 4.x+ installation:

import com.johnsnowlabs.nlp.util.io.OcrHelper
val ocrHelper = new OcrHelper()

val df = ocrHelper.createDataset(spark, "/tmp/Test.pdf")

Update: There is a new doc for Spark OCR and special instructions for Databricks:

https://nlp.johnsnowlabs.com/docs/en/ocr

Maziyar
  • 1,913
  • 2
  • 18
  • 37