Store latest state of data on S3 using Spark

Question

I am working on Spark-EMR job. My requirement is to read data from s3 every hour, do the same flatten transformation, and save the latest state of data based on machine-id. I can get the same machine-id data in the next hour also so I need to maintain the latest state of that machine. As we know data s3 is not a database and it does not support update operation. So I created an s3 partition based on machine-id.

       df
      .repartition($"machineid")
      .write
      .mode("overwrite")
      .partitionBy("machineid")
      .parquet("s3a://bucket...")

when the job reads the same machine data in the next run, the job overwrites only that specific machine-id partition using this property spark.sql.sources.partitionOverwriteMode=dynamic. Everything is working fine. The only issue that I faced is each partition contains a few KB files. And also partitions count goes beyond 30K which creates a performance issue in Athena

I read somewhere about delta-lake. Delta Lake provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing. So I am planning to move parquet output to delta output.

My questions are -

What is the best way to upsert in s3 using spark? Partition by that unique column or is there any other way to solve this problem?
Should I use the same partitioning structure (machine-id) Or save data without partition on s3. Delta formate update operation will take care of the latest state of the machine.
If machine-id partition structure is correct then will it not cause performance issues when partitions count go beyond 50K

is machineid an index of your query? if yes, then you can add it to the partition index of the athena. See [this](https://aws.amazon.com/ko/blogs/big-data/improve-amazon-athena-query-performance-using-aws-glue-data-catalog-partition-indexes/) — Lamanus, Nov 30 '21 at 03:32
No, I used machineid as the partition. So that I can overwrite this partition if case same machineid data comes in the next run. — lucy, Nov 30 '21 at 04:52
do you need the previous hours data ie historical data or just want the latest hour data and can delete the previous hours data? — Nikunj Kakadiya, Dec 01 '21 at 10:27
Do not care about old data. I need to maintain the latest state of machine data that has a unique key machine-id. Means machine-id wise I need latest data. — lucy, Dec 02 '21 at 05:25

Store latest state of data on S3 using Spark

0 Answers0