Version control of big data tables (iceberg)

Question

I'm building a Iceberg tables on the top of a data lake. These tables are used for reporting tools. I'm trying to figure out what is the best way to control a version/deploy changes to these tables in CI/CD process. E.g. I could like to add a column to the Iceberg table. To do that I have to write a ALTER TABLE statement, save it to the git repository and deploy via CI/CD pipeline. Tables are accessible via AWS Glue Catalog. I couldn't find to much info about this in google so if anyone could share some knowledge, it would be much appreciated.

Cheers.

Version control of Iceberg tables.

score 1 · Answer 1 · answered Nov 01 '22 at 07:04

Agree with @Fokko Driesprong. This is a supplement only. Sometimes, table changes are considered as part of task version changes. That is, table change statements, ALTER TABLE, are bound to task upgrades. Tasks are sometimes automatically deployed. So it often executes a table change statement first, and then deploys a new task. If the change is disruptive, then we need to stop the old task first and then deploy the new one. Corresponding to the upgrade, we also have a rollback script, of course, the corresponding table change statement.

score 0 · Answer 2 · answered Oct 30 '22 at 18:42

thanks for asking this question. I don't think there is a definitive way of doing this. In practice I see most people bundling this as part of the job that writing to the Iceberg table. This way you can make sure that new columns are populated right away with the new version of the job. If you don't do any breaking changes (such as deletion of column), then the downstream jobs won't break. Hope this helps!

Version control of big data tables (iceberg)

2 Answers2