I am writing files into Minio S3 using pyspark 3.1.2 with. I am using partitioning so data shall be stored in batch_id's eg:
s3a://0001/transactions/batch_id=1 s3a://0001/transactions/batch_id=2 etc.
Everything works perfectly fine when writing to local file system.
However when I am using S3 with partitioned commiter (https://hadoop.apache.org/docs/r3.1.1/hadoop-aws/tools/hadoop-aws/committers.html)
With option: "partitionOverwriteMode" = "static"
eg:
data_frame.write.mode("overwrite").partitionBy("batch_id").orc(output_path)
The whole path including "transactions" is being overwritten (instead overwriting only given partition).
Settings:
spark_session.sparkContext._jsc.hadoopConfiguration().set(
"fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem"
)
spark_session.sparkContext._jsc.hadoopConfiguration().set(
"fs.s3a.path.style.access", "true"
)
spark_session.sparkContext._jsc.hadoopConfiguration().set(
"fs.s3a.committer.magic.enabled", "true"
)
spark_session.sparkContext._jsc.hadoopConfiguration().set(
"fs.s3a.committer.name", "partitioned"
)
spark_session.sparkContext._jsc.hadoopConfiguration().set(
"fs.s3a.committer.staging.conflict-mode", "replace"
)
spark_session.sparkContext._jsc.hadoopConfiguration().set(
"fs.s3a.committer.staging.abort.pending.uploads", "true"
)