0

I am writing files into Minio S3 using pyspark 3.1.2 with. I am using partitioning so data shall be stored in batch_id's eg:

s3a://0001/transactions/batch_id=1 s3a://0001/transactions/batch_id=2 etc.

Everything works perfectly fine when writing to local file system.

However when I am using S3 with partitioned commiter (https://hadoop.apache.org/docs/r3.1.1/hadoop-aws/tools/hadoop-aws/committers.html)

With option: "partitionOverwriteMode" = "static" eg: data_frame.write.mode("overwrite").partitionBy("batch_id").orc(output_path)

The whole path including "transactions" is being overwritten (instead overwriting only given partition).

Settings:

        spark_session.sparkContext._jsc.hadoopConfiguration().set(
            "fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem"
        )
        spark_session.sparkContext._jsc.hadoopConfiguration().set(
            "fs.s3a.path.style.access", "true"
        )
        spark_session.sparkContext._jsc.hadoopConfiguration().set(
            "fs.s3a.committer.magic.enabled", "true"
        )
        spark_session.sparkContext._jsc.hadoopConfiguration().set(
            "fs.s3a.committer.name", "partitioned"
        )
        spark_session.sparkContext._jsc.hadoopConfiguration().set(
            "fs.s3a.committer.staging.conflict-mode", "replace"
        )
        spark_session.sparkContext._jsc.hadoopConfiguration().set(
            "fs.s3a.committer.staging.abort.pending.uploads", "true"
        )
Mateusz
  • 101
  • 1
  • 6
  • Update: when saving using append mode the given partition is not replaced but rather incremented. – Mateusz Nov 17 '21 at 16:12

1 Answers1

0

So I have added more jars:

spark-hadoop-cloud_2.13-3.2.0.jar

And followed the spark cloud integration guides:[(https://spark.apache.org/docs/latest/cloud-integration.html)][1]

Which boiled down for adding the:

"spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version", "2"
"spark.sql.sources.commitProtocolClass", "org.apache.spark.internal.io.cloud.PathOutputCommitProtocol"
"spark.sql.parquet.output.committer.class", "org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter"

And switched back to parquet. Now I am able to overwrite the partition without overwriting hole path.

Mateusz
  • 101
  • 1
  • 6