I am currently trying to write a delta-lake
parquet
file to S3, which I replace with a MinIO locally.
I can perfectly fine read/write standard parquet
files to S3
.
However, when I use the delta lake example
Configure delta to s3
It seems I can't write delta_log/
to my MinIO
.
So I tried to set: fs.AbstractFileSystem.s3a.impl
and fs.s3a.impl
.
I am using pyspark[sql]==2.4.3
which I use in my current venv
.
src/.env
:
# pyspark packages
DELTA = io.delta:delta-core_2.11:0.3.0
HADOOP_COMMON = org.apache.hadoop:hadoop-common:2.7.3
HADOOP_AWS = org.apache.hadoop:hadoop-aws:2.7.3
PYSPARK_SUBMIT_ARGS = ${HADOOP_AWS},${HADOOP_COMMON},${DELTA}
src/spark_session.py
:
# configure s3 connection for read/write operation (native spark)
hadoop_conf = sc.sparkContext._jsc.hadoopConfiguration()
hadoop_conf.set("fs.s3a.endpoint", self.aws_endpoint_url)
hadoop_conf.set("fs.s3a.access.key", self.aws_access_key_id)
hadoop_conf.set("fs.s3a.secret.key", self.aws_secret_access_key)
# hadoop_conf.set("fs.AbstractFileSystem.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") # when using hadoop 2.8.5
# hadoop_conf.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") # alternative to above hadoop 2.8.5
hadoop_conf.set("fs.s3a.path.style.access", "true")
hadoop_conf.set("spark.history.fs.logDirectory", 's3a://spark-logs-test/')
src/apps/raw_to_parquet.py
# Trying to write pyspark dataframe to MinIO (S3)
raw_df.coalesce(1).write.format("delta").save(s3_url)
bash
:
# RUN CODE
spark-submit --packages $(PYSPARK_SUBMIT_ARGS) src/run_onlineretailer.py
Error with hadoop-common: 2.7.3
, hadoop-aws: 2.7.3
: java.lang.RuntimeException: java.lang.NoSuchMethodException: org.apache.hadoop.fs.s3a.S3AFileSystem.<init>(java.net.URI, org.apache.hadoop.conf.Configuration)
So with this error I then updated to hadoop-common: 2.8.5
, hadoop-aws: 2.8.5
, to fix the NoSuchMethodException
. Because delta
needs: S3AFileSystem
py4j.protocol.Py4JJavaError: An error occurred while calling o89.save.
: java.lang.NoSuchMethodError: org.apache.hadoop.security.ProviderUtils.excludeIncompatibleCredentialProviders(Lorg/apache/hadoop/conf/Configuration;Ljava/lang/Class;)Lorg/apache/hadoop/conf/Configuration
So to me it seems that the parquet
file can be written without a problem, however, delta creates these delta_log
folder which cannot be recognised (I think?).
Current source code.
Read several different similar questions but nobody seem to have been trying with delta lake
files.
UPDATE
It currently work with these settings:
#pyspark packages
DELTA_LOGSTORE = spark.delta.logStore.class=org.apache.spark.sql.delta.storage.S3SingleDriverLogStore
DELTA = io.delta:delta-core_2.11:0.3.0
HADOOP_COMMON = org.apache.hadoop:hadoop-common:2.7.7
HADOOP_AWS = org.apache.hadoop:hadoop-aws:2.7.7
PYSPARK_SUBMIT_ARGS = ${HADOOP_AWS},${HADOOP_COMMON},${DELTA}
PYSPARK_CONF_ARGS = ${DELTA_LOGSTORE}
# configure s3 connection for read/write operation (native spark)
hadoop_conf = sc.sparkContext._jsc.hadoopConfiguration()
hadoop_conf.set("fs.s3a.endpoint", self.aws_endpoint_url)
hadoop_conf.set("fs.s3a.access.key", self.aws_access_key_id)
hadoop_conf.set("fs.s3a.secret.key", self.aws_secret_access_key)
spark-submit --packages $(PYSPARK_SUBMIT_ARGS) --conf $(PYSPARK_CONF_ARGS) src/run_onlineretailer.py
The weird thing is that it will only work like this.
If I try to set it with sc.conf
or hadoop_conf
it does not work, see uncommented code:
def spark_init(self) -> SparkSession:
sc: SparkSession = SparkSession \
.builder \
.appName(self.app_name) \
.config("spark.sql.warehouse.dir", self.warehouse_location) \
.getOrCreate()
# set log level
sc.sparkContext.setLogLevel("WARN")
# Enable Arrow-based columnar data transfers
sc.conf.set("spark.sql.execution.arrow.enabled", "true")
# sc.conf.set("spark.delta.logStore.class", "org.apache.spark.sql.delta.storage.S3SingleDriverLogStore") # does not work
# configure s3 connection for read/write operation (native spark)
hadoop_conf = sc.sparkContext._jsc.hadoopConfiguration()
hadoop_conf.set("fs.s3a.endpoint", self.aws_endpoint_url)
hadoop_conf.set("fs.s3a.access.key", self.aws_access_key_id)
hadoop_conf.set("fs.s3a.secret.key", self.aws_secret_access_key)
#hadoop_conf.set("spark.delta.logStore.class", "org.apache.spark.sql.delta.storage.S3SingleDriverLogStore") # does not work
return sc
If somebody can explain this, it would be great. Is it because .getOrCreate()
? Seems impossible to set the conf
without this call? Except in the command line when running application.