Redshift - DataWarehouse Data Refresh

Question

I have my data warehouse built on Amazon Redshift. The problem I am currently facing is, I have a huge fact table (with about 500M rows) in my schema with data for about 10 clients. I have processes that periodically (mostly daily) generates data for this fact table and requires a refresh, meaning - delete old data and insert the newly generated data.

The problem is, this bulk delete-insert operation leave holes in my fact table with a need to VACUUM which is time consuming and hence can't be done immediately. And this fact table (with huge holes due to deleted data) dramatically impacts the snapshot time which consumes data from the fact and dimension tables and refreshes it in the downstream presentation area. How can I optimize such bulk data refresh in a DWH environment?

I believe this should be a well known problem in DWH with some recommended ways to solve. Can anyone please point out the recommended solution?

P.S: One solution can be to create table per client and have a view on top of it which does a union of all the underlying tables. In this case, if I break the fact table per client it is pretty small and can be vacuumed quickly after delete-insert, but looking for solutions with better maintainability.

too wide question, it depends on nature of data and how much exactly do you refresh (the entire table or a portion of it) — AlexYes, Aug 24 '17 at 14:33
@AlexYes So consider I have data for 10 clients in the table. There are jobs that generate data for these clients at different schedules (some runs daily and some weekly). And when they generate data, I do a "replace" on this table which is delete-insert. Depending on the client, the data that I replace in one go can vary between 10% to 30% of table size. Not sure what you mean by nature of data? — Saurabh Agrawal, Aug 24 '17 at 17:30
By nature of data I mean the following: if your entities are immutable (for example you count the N of pageviews by day) the produced aggregates for day N will be the same when you calculate it on the next day N+1 and when you calculate it on day N+10, you can stop refreshing the entire table and can do append only. If your entities are not completely immutable but are stabilized after a certain period of time (for example sales conversion by lead cohort that likely happens within 30 days from the lead) you can process data with a trailing window, i.e. reprocess only the last X days every time — AlexYes, Aug 24 '17 at 18:15

score 0 · Answer 1 · answered Aug 24 '17 at 16:49

You might try to play with different types of Vacuum, there is "VACUUM DELETE ONLY", which will reclaim the space, but won't resort lines, not sure if its applicable for your use case.

More info here: http://docs.aws.amazon.com/redshift/latest/dg/t_Reclaiming_storage_space202.html

Or I used deep copy approach when I was fighting with vacuuming tables with too many columns. Problem with this could be that you will need a lot of space for intermediate step.

More info here: http://docs.aws.amazon.com/redshift/latest/dg/performing-a-deep-copy.html

Redshift - DataWarehouse Data Refresh

1 Answers1