Bootstrapping spark-avro jar to Amazon EMR cluster

Question

I want to read avro files located in Amazon S3 from the Zeppelin notebook. I understand Databricks has a wonderful package for it spark-avro. What are the steps that I need to take in order to bootstrap this jar file to my cluster and make it working?

When I write this in my notebook, val df = sqlContext.read.avro("s3n://path_to_avro_files_in_one_bucket/")

I get the below error - <console>:34: error: value avro is not a member of org.apache.spark.sql.DataFrameReader

I have had a look at this. I guess the solution posted there does not work for the latest version of Amazon EMR.

If someone could give me pointers, that would really help.

score 0 · Accepted Answer · answered Aug 09 '16 at 15:21

Here is how I associate the spark-avro dependencies. This method works for associating any other dependencies to spark.

Make sure your spark version is compatible with your spark-avro. You'll find the details of the dependencies here.
I put my spark-avro file in my S3 bucket. You can use hdfs or any other store.
While launching an EMR cluster, add the following JSON in the configuration, [{"classification":"spark-defaults", "properties":{"spark.files":"/path_to_spark-avro_jar_file", "spark.jars":"/path_to_spark-avro_jar_file"}, "configurations":[]}]

This is not the only way to do this. Please refer this link for more details.

score 0 · Answer 2 · answered Dec 03 '18 at 22:54

0

One more option is to add the --dependencies option to either spark-shell or spark submit (this is for spark 2.x)

--packages com.databricks:spark-avro_2.11:4.0.0

answered Dec 03 '18 at 22:54

Andrew Long

863
4
9

Bootstrapping spark-avro jar to Amazon EMR cluster

2 Answers2