0

I am working on Spark-EMR job. My requirement is to read data from s3 every hour, do the same flatten transformation, and save the latest state of data based on machine-id. I can get the same machine-id data in the next hour also so I need to maintain the latest state of that machine. As we know data s3 is not a database and it does not support update operation. So I created an s3 partition based on machine-id.

       df
      .repartition($"machineid")
      .write
      .mode("overwrite")
      .partitionBy("machineid")
      .parquet("s3a://bucket...")

when the job reads the same machine data in the next run, the job overwrites only that specific machine-id partition using this property spark.sql.sources.partitionOverwriteMode=dynamic. Everything is working fine. The only issue that I faced is each partition contains a few KB files. And also partitions count goes beyond 30K which creates a performance issue in Athena

I read somewhere about delta-lake. Delta Lake provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing. So I am planning to move parquet output to delta output.

My questions are -

  1. What is the best way to upsert in s3 using spark? Partition by that unique column or is there any other way to solve this problem?

  2. Should I use the same partitioning structure (machine-id) Or save data without partition on s3. Delta formate update operation will take care of the latest state of the machine.

  3. If machine-id partition structure is correct then will it not cause performance issues when partitions count go beyond 50K

Nikunj Kakadiya
  • 2,689
  • 2
  • 20
  • 35
lucy
  • 4,136
  • 5
  • 30
  • 47
  • is machineid an index of your query? if yes, then you can add it to the partition index of the athena. See [this](https://aws.amazon.com/ko/blogs/big-data/improve-amazon-athena-query-performance-using-aws-glue-data-catalog-partition-indexes/) – Lamanus Nov 30 '21 at 03:32
  • No, I used machineid as the partition. So that I can overwrite this partition if case same machineid data comes in the next run. – lucy Nov 30 '21 at 04:52
  • do you need the previous hours data ie historical data or just want the latest hour data and can delete the previous hours data? – Nikunj Kakadiya Dec 01 '21 at 10:27
  • Do not care about old data. I need to maintain the latest state of machine data that has a unique key machine-id. Means machine-id wise I need latest data. – lucy Dec 02 '21 at 05:25

0 Answers0