Via Concord, we can automatically spawn clusters with pyspark enabled dataproc clusters.
In these pyspark notebooks, spark version is 2.4.8
But, by default spark does not have .avro datasource extension. Without Avro extension, we can not read .avro files. We tried with the following setups in configuration but did not work.
Pyspark Session Configuration
from pyspark.sql import SparkSession
from pyspark import SparkConf
import os
# Set the python paths
os.environ['PYSPARK_PYTHON'] = './PYENV1/pyenv1/bin/python3'
os.environ['PYSPARK_DRIVER_PYTHON'] = './PYENV1/pyenv1/bin/python3'
# If your notebook's kernel is PySpark, stop the active spark app
spark.sparkContext.stop()
conf = SparkConf()
# Add the following variable to the spark config - this will let worker nodes find the installed packages
conf.setAll([
("spark.app.name", "Avro Testing"), \
("spark.jars.packages","org.apache.spark:spark-avro_2.12:2.4.8"), \
# ("spark.yarn.dist.archives", "/opt/conda/anaconda/envs/d0d01af1.zip#D0D01AF1"), \
# ('spark.executor.cores', '2'),\
# ('spark.driver.memory','5g'),\
# ('spark.task.maxFailures','100'), \
# ('spark.executor.instances', '100'), \
# ('spark.driver.maxResultSize','20g'), \
# ('spark.sql.shuffle.partitions','2048'),\
# ('spark.default.parallelism','2048'), \
# ('spark.dynamicAllocation.enabled','False'), \
# ("spark.files.overwrite", "True"), \
# ("spark.sql.broadcastTimeout", "36000"), \
# ("spark.sql.autoBroadcastJoinThreshold", -1), \
# ("spark.sql.hive.convertMetastoreOrc", "False")
])
# conf.set("spark.jars","org.apache.spark:spark-avro_2.12:2.4.8")
Code for read data
GCSPATH = "gs://gcs_bucket/file.avro"
spark.read.format("avro").load(GCS_PATH)
Error
---------------------------------------------------------------------------
Py4JJavaError Traceback (most recent call last)
/usr/lib/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
62 try:
---> 63 return f(*a, **kw)
64 except py4j.protocol.Py4JJavaError as e:
/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
327 "An error occurred while calling {0}{1}{2}.\n".
--> 328 format(target_id, ".", name), value)
329 else:
Py4JJavaError: An error occurred while calling o622.load.
: org.apache.spark.sql.AnalysisException: Failed to find data source: avro. Avro is built-in but external data source module since Spark 2.4. Please deploy the application as per the deployment section of "Apache Avro Data Source Guide".;
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:665)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:213)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:197)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:750)