1

I am trying to read avro file in jupyter notebook but facing this issue.

Caused by: java.lang.ClassNotFoundException: org.apache.spark.sql.avro.AvroFileFormat.DefaultSource

and I can't seem to figure out where how to get this dependency from.

import findspark
findspark.init()

import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.types import *

spark = SparkSession.builder.appName("readavro").master("local").getOrCreate()

result = spark.read.format('com.databricks.spark.avro').load("file:///C:/Downloads/part-r-00000.avro")
user1298426
  • 3,467
  • 15
  • 50
  • 96

1 Answers1

1

Make sure you add org.apache.spark:spark-avro_2.12:2.4.5 jar to your classpath. Since spark-avro module is external, there is no .avro API in DataFrameReader or DataFrameWriter. So try

result = spark.read.format('avro').load("file:///C:/Downloads/part-r-00000.avro")

include the avro dependency

$ bin/spark-shell --packages com.databricks:spark-avro_2.12:2.4.5
QuickSilver
  • 3,915
  • 2
  • 13
  • 29