pyspark write overwrite is partitioned but is still overwriting the previous load

Question

I am running a pyspark script where I'm saving off some data to a s3 bucket each time the script is run and I have this code:

data.repartition(1).write.mode("overwrite").format("parquet").partitionBy('time_key').save( "s3://path/to/directory")

It is partitioned by time_key but at each run, but the latest data dump is overwriting the previous data instead of being adding a partition. The time_key is unique to each run.

Is this the correct code if I want to write the data to s3 and partition by time key at each run?

score 2 · Answer 1 · answered Oct 23 '19 at 05:54

If you are on Spark Version 2.3 + then this issue has been fixed via https://issues.apache.org/jira/browse/SPARK-20236

You have to set the spark.sql.sources.partitionOverwriteMode="dynamic" flag to overwrite the specific partition of the data.

And also as per your statement time_key being unique for each run you could probably make use of the append mode itself.

pyspark write overwrite is partitioned but is still overwriting the previous load

1 Answers1