spark 2.4 com.databricks.spark.avro trouble-shooting

Question

I have a spark-job, that I usually submit to a hadoop cluster from a local machine. When I submit it with spark 2.2.0 it works fine, but fails to start when i submit it with version 2.4.0. Just the the SPARK_HOME makes the difference.

drwxr-xr-x  18 me  576 Jan 23 14:15 spark-2.4.0-bin-hadoop2.6
drwxr-xr-x  17 me  544 Jan 23 14:15 spark-2.2.0-bin-hadoop2.6

I submit the job like

spark-submit \
--master yarn \
--num-executors 20 \
--deploy-mode cluster \
--executor-memory 8g \
--driver-memory 8g \
--class package.MyMain uberjar.jar \
--param1 ${BLA} \
--param2 ${BLALA}

Why does the new spark version refuse to take my uberjar? I did not find any changes in the spark 2.4 docu. Btw: The jar was built with spark 2.1 as dependency. Any ideas?

EDIT: I think my problem is NOT related to spark failing to find things in my uberjar. Much rather I might have a problem with the new built in avro functionality. As before, I read avro files by using the implicit function spark.read.avro from com.databricks.spark.avro._. Spark 2.4.0 has some new built in avro things (most of them to be found in org.apache.spark:spark-avro_2.*11*:2.4.0). The fail might have something to do with this.

java.lang.ClassNotFoundException: Failed to find data source: org.apache.spark.sql.avro.AvroFileFormat. Please find packages at http://spark.apache.org/third-party-projects.html
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:657)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:194)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
at com.databricks.spark.avro.package$AvroDataFrameReader$$anonfun$avro$2.apply(package.scala:34)
at com.databricks.spark.avro.package$AvroDataFrameReader$$anonfun$avro$2.apply(package.scala:34)
at myproject.io.TrainingFileIO.readVectorAvro(TrainingFileIO.scala:59)
at myproject.training.MainTraining$.train(MainTraining.scala:37)

soo. i think the problem lies deeper. the actual error i get is:

i have some findings -> might have to do with a conflict between spark 2.40. built in avro and the com.databricks.spark.avro which i use. but atm not solved the problem. ill ahve a look later — Antalagor, Jan 28 '19 at 16:36

score 6 · Answer 1 · answered Jan 28 '19 at 17:06

6

It seems spark 2.4.0 needs --packages org.apache.spark:spark-avro_2.11:2.4.0 in order to run the old com.databricks.spark.avro code lines. Here is some description https://spark.apache.org/docs/latest/sql-data-sources-avro.html

So my problem didnt have anything to do with a missing class in my jar, it much rather had some problems with the new built-in avro things in the new spark version.

answered Jan 28 '19 at 17:06

Antalagor

428
4
10

How i add the package? Is it a jar file? – djohon Jun 17 '19 at 10:01
You add it as mentioned above by specifying the maven coordinates. But I am not quite certain, where you configure the repositories, `spark-submit` looks at while resolving the dependencies. In my case it looks in my local maven repo and at the central repository as well as in a remote spark-package repo. Anyway you may specify your desired repo with `--repositories`. If you want to submit via additional jar, you can do this by `--jars`. – Antalagor Jun 20 '19 at 15:24

spark 2.4 com.databricks.spark.avro trouble-shooting

1 Answers1