1

I am really struggling with this one. Spent a lot of time searching for an answer in Spark manual and stack-overflow posts. Really need help.

I've installed Apache Spark on my mac to build and debug PySpark code locally. However, in my PySpark code I need to read an avro file into a dataframe and use spark-avro package. I am using spark-submit to source spark-avro from the Internet and run my code. It works fine:

spark-submit --packages org.apache.spark:spark-avro_2.12:3.2.1 spark-load-avro.py

My code:

# Create Spark session.
spark = SparkSession.builder.getOrCreate()

# Create Spark context.
sc = spark.sparkContext

# Load avro files into the dataframe.
df = spark.read.format("avro").load("/my-avro-files/")

But I need the spark-avro package to be sourced directly from my mac. I found the spark-avro jar on https://spark-packages.org/package/databricks/spark-avro: spark-avro_2.11-4.0.0.jar and saved it in a local directory on my mac.

How do I make spark-submit load spark-avro from that jar on my mac? I tried this, but it did not work:

spark-submit --conf "spark.driver.extraClassPath=/spark-jars/spark-avro_2.11-4.0.0.jar" spark-load-avro.py

Error:

pyspark.sql.utils.AnalysisException:  Failed to find data source: avro. Avro is built-in but external data source module since Spark 2.4. Please deploy the application as per the deployment section of "Apache Avro Data Source Guide".

Could anyone explain how to use spark-avro package with spark-submit from the local jar on my mac? What needs to be done besides downloading the spark-avro jar?

I am looking for an example spark-submit command and syntax and any detailed prerequisite steps (if there are).

I am assuming I am on the right path with downloading the spark-avro_2.11-4.0.0.jar as described above in order to reference spark-avro package directly from my mac instead of the Internet...

Please help :)

bda
  • 372
  • 1
  • 7
  • 22

0 Answers0