I have an existing dataproc cluster with spark version 3.3 As per the doc https://docs.delta.io/latest/releases.html, Deltalake version 2.3 is compatible with spark 3.3. Hence followed below steps to install deltalake
- Configuration on Jupyter
Kernel: /opt/conda/miniconda3/bin/python
Python version: 3.10.8 | packaged by conda-forge | (main, Nov 22 2022, 08:23:14) [GCC 10.4.0]
PySpark version: 3.4.1
spark version: 3.3.0
- on master node, executed
pip install delta-spark==2.3.0
- downloaded Deltalake jar to /usr/lib/spark/jars/ using below command
sudo wget https://repo1.maven.org/maven2/io/delta/delta-core_2.12/2.3.0/delta-core_2.12-2.3.0.jar
- Added below entry in /etc/spark/conf/spark-deafults.conf
spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension
spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog
Question: when I try to write a dataframe in delta format on Jupyter notebook emp_details.write.format("delta").mode("overwrite").save(delta_path)
, running into below.
Error:
Py4JJavaError: An error occurred while calling o90.save.
: com.google.common.util.concurrent.ExecutionError: java.lang.NoClassDefFoundError: Could not initialize class org.apache.spark.sql.delta.storage.DelegatingLogStore$
Tried also to set below param, but running into same error.
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages io.delta:delta-core_2.12:2.3.0 --conf spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension --conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog pyspark-shell'
Error with above PYSPARK_SUBMIT_ARGS on Jupyter:
Py4JJavaError: An error occurred while calling o84.save.
: com.google.common.util.concurrent.ExecutionError: java.lang.NoClassDefFoundError: io/delta/storage/LogStore
Please ensure that the delta-storage dependency is included.
If using Python, please ensure you call `configure_spark_with_delta_pip` or use
`--packages io.delta:delta-core_<scala-version>:<delta-lake-version>`.
See https://docs.delta.io/latest/quick-start.html#python.
More information about this dependency and how to include it can be found here:
https://docs.delta.io/latest/porting.html#delta-lake-1-1-or-below-to-delta-lake-1-2-or-above.
Update-1: Followed the setup instructions from https://delta.io/learn/getting-started/, but running into same above error.
Update-2: also used delta-storage jar, now running into different error below.