I am working on Spark-EMR job. My requirement is to read data from s3 every hour, do the same flatten transformation, and save the latest state of data based on machine-id. I can get the same machine-id data in the next hour also so I need to maintain the latest state of that machine. As we know data s3 is not a database and it does not support update operation. So I created an s3 partition based on machine-id.
df
.repartition($"machineid")
.write
.mode("overwrite")
.partitionBy("machineid")
.parquet("s3a://bucket...")
when the job reads the same machine data in the next run, the job overwrites only that specific machine-id partition using this property spark.sql.sources.partitionOverwriteMode=dynamic
. Everything is working fine. The only issue that I faced is each partition contains a few KB files. And also partitions count goes beyond 30K which creates a performance issue in Athena
I read somewhere about delta-lake. Delta Lake provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing. So I am planning to move parquet output to delta output.
My questions are -
What is the best way to upsert in s3 using spark? Partition by that unique column or is there any other way to solve this problem?
Should I use the same partitioning structure (machine-id) Or save data without partition on s3. Delta formate update operation will take care of the latest state of the machine.
If machine-id partition structure is correct then will it not cause performance issues when partitions count go beyond 50K