How to create pyspark session in Jupyter Notebook (under Dataproc Cluster) with avro datasource extension?

Question

Via Concord, we can automatically spawn clusters with pyspark enabled dataproc clusters.

In these pyspark notebooks, spark version is 2.4.8

But, by default spark does not have .avro datasource extension. Without Avro extension, we can not read .avro files. We tried with the following setups in configuration but did not work.

Pyspark Session Configuration

from pyspark.sql import SparkSession
from pyspark import SparkConf
import os
 
# Set the python paths
os.environ['PYSPARK_PYTHON'] = './PYENV1/pyenv1/bin/python3'
os.environ['PYSPARK_DRIVER_PYTHON'] = './PYENV1/pyenv1/bin/python3'
 
# If your notebook's kernel is PySpark, stop the active spark app
spark.sparkContext.stop()
 
conf = SparkConf()
 
# Add the following variable to the spark config - this will let worker nodes find the installed packages
conf.setAll([
    ("spark.app.name", "Avro Testing"), \
    ("spark.jars.packages","org.apache.spark:spark-avro_2.12:2.4.8"), \
    # ("spark.yarn.dist.archives", "/opt/conda/anaconda/envs/d0d01af1.zip#D0D01AF1"), \
    # ('spark.executor.cores', '2'),\
    # ('spark.driver.memory','5g'),\
    # ('spark.task.maxFailures','100'), \
    # ('spark.executor.instances', '100'), \
    # ('spark.driver.maxResultSize','20g'), \
    # ('spark.sql.shuffle.partitions','2048'),\
    # ('spark.default.parallelism','2048'), \
    # ('spark.dynamicAllocation.enabled','False'), \
    # ("spark.files.overwrite", "True"), \
    # ("spark.sql.broadcastTimeout", "36000"), \
    # ("spark.sql.autoBroadcastJoinThreshold", -1), \
    # ("spark.sql.hive.convertMetastoreOrc", "False")
])

# conf.set("spark.jars","org.apache.spark:spark-avro_2.12:2.4.8")

Code for read data

GCSPATH = "gs://gcs_bucket/file.avro"
spark.read.format("avro").load(GCS_PATH)

Error

---------------------------------------------------------------------------
Py4JJavaError                             Traceback (most recent call last)
/usr/lib/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
     62         try:
---> 63             return f(*a, **kw)
     64         except py4j.protocol.Py4JJavaError as e:

/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
    327                     "An error occurred while calling {0}{1}{2}.\n".
--> 328                     format(target_id, ".", name), value)
    329             else:

Py4JJavaError: An error occurred while calling o622.load.
: org.apache.spark.sql.AnalysisException: Failed to find data source: avro. Avro is built-in but external data source module since Spark 2.4. Please deploy the application as per the deployment section of "Apache Avro Data Source Guide".;
    at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:665)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:213)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:197)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Thread.java:750)

How to create pyspark session in Jupyter Notebook (under Dataproc Cluster) with avro datasource extension?

0 Answers0