Trouble reading avro files in Jupyter notebook using pyspark

Question

I am trying to read an avro file in Jupyter notebook using pyspark. When I read the file i am getting an error.

I have downloaded spark-avro_2.11:4.0.0.jar, i am not sure where in my code I should be inserting the avro package. Any suggestions would be great.

This is an example of the code I am using to read the avro file

df_avro_example = sqlContext.read.format("com.databricks.spark.avro").load("example_file.avro")

This is the error I get

AnalysisException: 'Failed to find data source: com.databricks.spark.avro. Please find an Avro package at http://spark.apache.org/third-party-projects.html;'

score 0 · Accepted Answer · answered Jun 17 '19 at 14:15

0

download the jar to a location and use the following code snippet in your pyspark app

import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--jars /path/tojar/spark-avro_2.11:4.0.0.jar pyspark-shell'

answered Jun 17 '19 at 14:15

Ranga Vure

1,922
3
16
23

Thank you for your help with this, your advice work!! – Conz Jun 30 '19 at 20:34
I have run into a bit of trouble with my dates and was wondering what I need to do to correct the issue. I used the below example to pull data for last day of April, the whole month of May and the first day in June. But I am now looking to pull data for last day of December, the whole month of January and the first day in Feb. But because December is 2018 I am not sure how to adjust my code. Any suggestions @Ranga Vure example_file.avro/20190{430,5,601}*\") – Conz Jun 30 '19 at 20:37

Trouble reading avro files in Jupyter notebook using pyspark

1 Answers1

Linked