I'm currently trying to run a Spark Scala job on our HDInsight cluster with the external library spark-avro, without success. Could someone help me out with this? The goal is to find the necesseray steps to be able to read avro files residing on Azure blob storage on HDInsight clusters.
Current specs:
- Spark 2.0 on Linux (HDI 3.5) clustertype
- Scala 2.11.8
- spark-assembly-2.0.0-hadoop2.7.0-SNAPSHOT.jar
- spark-avro_2.11:3.2.0
tutorial used: https://learn.microsoft.com/en-us/azure/hdinsight/hdinsight-apache-spark-intellij-tool-plugin
Spark scala code:
based on the example on: https://github.com/databricks/spark-avro
import com.databricks.spark.avro._
import org.apache.spark.sql.SparkSession
object AvroReader {
def main (arg: Array[String]): Unit = {
val spark = SparkSession.builder().master("local").getOrCreate()
val df = spark.read.avro("wasb://container@storageaccount.blob.core.windows.net/directory")
df.head(5)
}
}
Error received:
java.lang.NoClassDefFoundError: com/databricks/spark/avro/package$
at MediahuisHDInsight.AvroReader$.main(AvroReader.scala:14)
at MediahuisHDInsight.AvroReader.main(AvroReader.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:627)
Caused by: java.lang.ClassNotFoundException: com.databricks.spark.avro.package$
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 7 more