Dynamic Partition Overwrite in Apache Iceberg

Asked Jun 21 '23 at 16:18

Active Jun 21 '23 at 17:08

Viewed 99 times

I am trying to learn Apache Iceberg for building a data lake. We have late arriving data and the data is partitioned on date column. I will have a spark job that will transform the incoming data to iceberg format. Consider a scenario where the ingestion pipeline fails mid way, since dynamic partition overwrite is enabled, a rerun of the task will overwrite only the partitions that were created in previous run. This will work if there is no late arriving data. But consider a situation where I have data from the last day and the partition for that has already been created when I ingested the data in the yesterday’s run. Now that the current day (which has late arriving data) run fails and since I have set dynamic partition overwrite, the partition that was created yesterday will also get rewritten. Is there a better way to handle dynamic partition overwrite for idempotency as well and late arriving data

edited Jun 21 '23 at 17:08

asked Jun 21 '23 at 16:18

Rohit Anil

Dynamic Partition Overwrite in Apache Iceberg

0 Answers0