0

I am trying to learn Apache Iceberg for building a data lake. We have late arriving data and the data is partitioned on date column. I will have a spark job that will transform the incoming data to iceberg format. Consider a scenario where the ingestion pipeline fails mid way, since dynamic partition overwrite is enabled, a rerun of the task will overwrite only the partitions that were created in previous run. This will work if there is no late arriving data. But consider a situation where I have data from the last day and the partition for that has already been created when I ingested the data in the yesterday’s run. Now that the current day (which has late arriving data) run fails and since I have set dynamic partition overwrite, the partition that was created yesterday will also get rewritten. Is there a better way to handle dynamic partition overwrite for idempotency as well and late arriving data

Rohit Anil
  • 236
  • 1
  • 11

0 Answers0