Why is adding org.apache.spark.avro dependency is mandatory to read/write avro files in Spark2.4 while I'm using com.databricks.spark.avro?

Question

I tried to run my Spark/Scala code 2.3.0 on a Cloud Dataproc cluster 1.4 where there's Spark 2.4.8 installed. I faced an error concerning the reading of avro files. Here's my code :

sparkSession.read.format("com.databricks.spark.avro").load(input)

This code failed as expected. Then I added this dependency to my pom.xml file:

<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-avro_2.11</artifactId>
    <version>2.4.0</version>
</dependency>

Which made my code run successfully. And this is the part that I don't understand , I'm still using the module com.databricks.spark.avro in my code. Why is adding org.apache.spark.avro dependency solved my problem, knowing that I'm not really using it in my code?

I was expecting that I will need to change my code to something like this:

sparkSession.read.format("avro").load(input)

I guess what happened was that "com.databricks.spark.avro" is now just an alias of "avro" for backward compatibility. — Dagang, Dec 17 '21 at 23:25
It is important to add the correct dependencies in your `pom.xml` file. Because this file contains information about the project and configuration details used by Maven to build the project. The dependency that you are using `org.apache.spark` is needed to load/save data in Avro format, you need to specify the data source option format as `avro` or `org.apache.spark.avro`. — Jose Gutierrez Paliza, Dec 20 '21 at 23:46

score 2 · Accepted Answer · answered Dec 21 '21 at 01:12

This is historic artifact of the fact that initially Spark Avro support was added by Databricks in their proprietary Spark Runtime as com.databricks.spark.avro format, when Sark Avro support was added to open-source Spark as avro format then, for backward compatibility, support of the com.databricks.spark.avro format was retained if spark.sql.legacy.replaceDatabricksSparkAvro.enabled property is set to true:

If it is set to true, the data source provider com.databricks.spark.avro is mapped to the built-in but external Avro data source module for backward compatibility.

That does explain it :) Thank you Igor for your answer – jonas lahwf Dec 22 '21 at 09:17 — jonas lahwf, Dec 22 '21 at 09:17

Why is adding org.apache.spark.avro dependency is mandatory to read/write avro files in Spark2.4 while I'm using com.databricks.spark.avro?

1 Answers1