0

I am currently trying to write a delta-lake parquet file to S3, which I replace with a MinIO locally.

I can perfectly fine read/write standard parquet files to S3.

However, when I use the delta lake example

Configure delta to s3

It seems I can't write delta_log/ to my MinIO.

So I tried to set: fs.AbstractFileSystem.s3a.impl and fs.s3a.impl.

I am using pyspark[sql]==2.4.3 which I use in my current venv.

src/.env:

# pyspark packages
DELTA = io.delta:delta-core_2.11:0.3.0
HADOOP_COMMON = org.apache.hadoop:hadoop-common:2.7.3
HADOOP_AWS = org.apache.hadoop:hadoop-aws:2.7.3
PYSPARK_SUBMIT_ARGS = ${HADOOP_AWS},${HADOOP_COMMON},${DELTA}

src/spark_session.py:

# configure s3 connection for read/write operation (native spark)
hadoop_conf = sc.sparkContext._jsc.hadoopConfiguration()
hadoop_conf.set("fs.s3a.endpoint", self.aws_endpoint_url)
hadoop_conf.set("fs.s3a.access.key", self.aws_access_key_id)
hadoop_conf.set("fs.s3a.secret.key", self.aws_secret_access_key)
# hadoop_conf.set("fs.AbstractFileSystem.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")  #  when using hadoop 2.8.5
# hadoop_conf.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")  #  alternative to above hadoop 2.8.5
hadoop_conf.set("fs.s3a.path.style.access", "true")
hadoop_conf.set("spark.history.fs.logDirectory", 's3a://spark-logs-test/')

src/apps/raw_to_parquet.py

# Trying to write pyspark dataframe to MinIO (S3)

raw_df.coalesce(1).write.format("delta").save(s3_url)


bash:

# RUN CODE
spark-submit --packages $(PYSPARK_SUBMIT_ARGS) src/run_onlineretailer.py

Error with hadoop-common: 2.7.3, hadoop-aws: 2.7.3: java.lang.RuntimeException: java.lang.NoSuchMethodException: org.apache.hadoop.fs.s3a.S3AFileSystem.<init>(java.net.URI, org.apache.hadoop.conf.Configuration)

So with this error I then updated to hadoop-common: 2.8.5, hadoop-aws: 2.8.5, to fix the NoSuchMethodException. Because delta needs: S3AFileSystem

py4j.protocol.Py4JJavaError: An error occurred while calling o89.save. : java.lang.NoSuchMethodError: org.apache.hadoop.security.ProviderUtils.excludeIncompatibleCredentialProviders(Lorg/apache/hadoop/conf/Configuration;Ljava/lang/Class;)Lorg/apache/hadoop/conf/Configuration

So to me it seems that the parquet file can be written without a problem, however, delta creates these delta_log folder which cannot be recognised (I think?).

Current source code.

Read several different similar questions but nobody seem to have been trying with delta lake files.

UPDATE

It currently work with these settings:

#pyspark packages
DELTA_LOGSTORE = spark.delta.logStore.class=org.apache.spark.sql.delta.storage.S3SingleDriverLogStore
DELTA = io.delta:delta-core_2.11:0.3.0
HADOOP_COMMON = org.apache.hadoop:hadoop-common:2.7.7
HADOOP_AWS = org.apache.hadoop:hadoop-aws:2.7.7
PYSPARK_SUBMIT_ARGS = ${HADOOP_AWS},${HADOOP_COMMON},${DELTA}
PYSPARK_CONF_ARGS = ${DELTA_LOGSTORE}
# configure s3 connection for read/write operation (native spark)
hadoop_conf = sc.sparkContext._jsc.hadoopConfiguration()
hadoop_conf.set("fs.s3a.endpoint", self.aws_endpoint_url)
hadoop_conf.set("fs.s3a.access.key", self.aws_access_key_id)
hadoop_conf.set("fs.s3a.secret.key", self.aws_secret_access_key)
spark-submit --packages $(PYSPARK_SUBMIT_ARGS) --conf $(PYSPARK_CONF_ARGS) src/run_onlineretailer.py

The weird thing is that it will only work like this.

If I try to set it with sc.conf or hadoop_conf it does not work, see uncommented code:

def spark_init(self) -> SparkSession:

    sc: SparkSession = SparkSession \
        .builder \
        .appName(self.app_name) \
        .config("spark.sql.warehouse.dir", self.warehouse_location) \
        .getOrCreate()

    # set log level
    sc.sparkContext.setLogLevel("WARN")

    # Enable Arrow-based columnar data transfers
    sc.conf.set("spark.sql.execution.arrow.enabled", "true")

    # sc.conf.set("spark.delta.logStore.class", "org.apache.spark.sql.delta.storage.S3SingleDriverLogStore") # does not work

    # configure s3 connection for read/write operation (native spark)
    hadoop_conf = sc.sparkContext._jsc.hadoopConfiguration()
    hadoop_conf.set("fs.s3a.endpoint", self.aws_endpoint_url)
    hadoop_conf.set("fs.s3a.access.key", self.aws_access_key_id)
    hadoop_conf.set("fs.s3a.secret.key", self.aws_secret_access_key)
    #hadoop_conf.set("spark.delta.logStore.class", "org.apache.spark.sql.delta.storage.S3SingleDriverLogStore") # does not work

    return sc

If somebody can explain this, it would be great. Is it because .getOrCreate()? Seems impossible to set the conf without this call? Except in the command line when running application.

Thelin90
  • 37
  • 2
  • 11
  • Can you try with a Spark package with Hadoop included? Also you need to put all conf calls, especially logStore, before SparkSession creation. See https://docs.delta.io/latest/delta-storage.html – fanfabbb Nov 14 '19 at 09:47

1 Answers1

-1

you are mixing hadoop-* jars; just like the spark ones, they only work if they are all from the same release

stevel
  • 12,567
  • 1
  • 39
  • 50
  • Okay but I don't see how I am mixing them? Because I use the same version for both? Is it because when I import `pyspark` it has its own version of `hadoop` and when I give it `packages` it is not for the same? Because it works now if I give it as a `conf` parameter, and I set the `fs.*` endpoints with the `sc.sparkContext._jsc.hadoopConfiguration()`. – Thelin90 Sep 10 '19 at 09:57
  • It would be quite helpful to point out where the mixed version is happening. – IzPrEE Jun 22 '20 at 22:05
  • ..I don't know where the mixed version is happening. I do recognise inconsistent JAR versions. This is a deployment/configuration problem and so, sadly, whoever sees the problem gets to fix. Me: I'd run storediag or some other way to locate the JARs hosting the conflicting classes https://github.com/steveloughran/cloudstore – stevel Jun 24 '20 at 11:24