0

I'm currently trying to run a Spark Scala job on our HDInsight cluster with the external library spark-avro, without success. Could someone help me out with this? The goal is to find the necesseray steps to be able to read avro files residing on Azure blob storage on HDInsight clusters.

Current specs:

  • Spark 2.0 on Linux (HDI 3.5) clustertype
  • Scala 2.11.8
  • spark-assembly-2.0.0-hadoop2.7.0-SNAPSHOT.jar
  • spark-avro_2.11:3.2.0

tutorial used: https://learn.microsoft.com/en-us/azure/hdinsight/hdinsight-apache-spark-intellij-tool-plugin

Spark scala code:

based on the example on: https://github.com/databricks/spark-avro

import com.databricks.spark.avro._
import org.apache.spark.sql.SparkSession

object AvroReader {

  def main (arg: Array[String]): Unit = {

    val spark = SparkSession.builder().master("local").getOrCreate()

    val df = spark.read.avro("wasb://container@storageaccount.blob.core.windows.net/directory")
    df.head(5)
  }
}

Error received:

java.lang.NoClassDefFoundError: com/databricks/spark/avro/package$
    at MediahuisHDInsight.AvroReader$.main(AvroReader.scala:14)
    at MediahuisHDInsight.AvroReader.main(AvroReader.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:627)
Caused by: java.lang.ClassNotFoundException: com.databricks.spark.avro.package$
    at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
    ... 7 more
JMordijck
  • 73
  • 7
  • Please provide your build file. It looks like your jar expects a certain runtime dependency. – Vidya Apr 04 '17 at 16:22

1 Answers1

0

By default your default_artifact.jar only contains your classes, not classes from the libraries you reference. You can presumably use the "Referenced Jars" input field for this.

Another way is to add your libraries, unpacked, to your artifact. Go to File -> Project Structure. Under Available Elements, right-click the spark-avro library and select Extract Into Output Root. Click OK, then Build -> Build Artifacts and resubmit.

  • Thank you for your help! The "Extract into output root" functionality fixed the issue. Much appreciated! – JMordijck Apr 05 '17 at 08:21