0

I am running a pyspark script where I'm saving off some data to a s3 bucket each time the script is run and I have this code:

data.repartition(1).write.mode("overwrite").format("parquet").partitionBy('time_key').save( "s3://path/to/directory")

It is partitioned by time_key but at each run, but the latest data dump is overwriting the previous data instead of being adding a partition. The time_key is unique to each run.

Is this the correct code if I want to write the data to s3 and partition by time key at each run?

Cards14
  • 99
  • 1
  • 9

1 Answers1

2

If you are on Spark Version 2.3 + then this issue has been fixed via https://issues.apache.org/jira/browse/SPARK-20236

You have to set the spark.sql.sources.partitionOverwriteMode="dynamic" flag to overwrite the specific partition of the data.

And also as per your statement time_key being unique for each run you could probably make use of the append mode itself.

DataWrangler
  • 1,804
  • 17
  • 32