0

I am trying to read an avro file in Jupyter notebook using pyspark. When I read the file i am getting an error.

I have downloaded spark-avro_2.11:4.0.0.jar, i am not sure where in my code I should be inserting the avro package. Any suggestions would be great.

This is an example of the code I am using to read the avro file

df_avro_example = sqlContext.read.format("com.databricks.spark.avro").load("example_file.avro")

This is the error I get

AnalysisException: 'Failed to find data source: com.databricks.spark.avro. Please find an Avro package at http://spark.apache.org/third-party-projects.html;'

piet.t
  • 11,718
  • 21
  • 43
  • 52
Conz
  • 3
  • 2

1 Answers1

0

download the jar to a location and use the following code snippet in your pyspark app

import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--jars /path/tojar/spark-avro_2.11:4.0.0.jar pyspark-shell' 
Ranga Vure
  • 1,922
  • 3
  • 16
  • 23
  • Thank you for your help with this, your advice work!! – Conz Jun 30 '19 at 20:34
  • I have run into a bit of trouble with my dates and was wondering what I need to do to correct the issue. I used the below example to pull data for last day of April, the whole month of May and the first day in June. But I am now looking to pull data for last day of December, the whole month of January and the first day in Feb. But because December is 2018 I am not sure how to adjust my code. Any suggestions @Ranga Vure example_file.avro/20190{430,5,601}*\") – Conz Jun 30 '19 at 20:37