Overhead of Redshift materialized view full REFRESH Vs Redshift VACUUM

Question

I’m working on a project where I want to have a fixed/consistent amount of data in terms of a specific time period (eg. 365 days). Which approach is more efficient and will cost lesser workload between the two?

Using a dedicated table and performing VACUUM to delete the oldest data
Using a materialized view and refresh it to remove the oldest data from the view (incremental refresh wont be possible in my case)

score 1 · Answer 1 · answered Sep 17 '22 at 03:19

1

There are a lot of variables that can affect which path to go down. Cluster size, table size, sortkey etc. can all come into play.

May I recommend a 3rd path?

You could populate "month" tables and combine them with a normal view. When a new month comes around a new table is created, the view is altered to include the new table and not the oldest, and drop the oldest table (or archived). The view can have a where clause that only looks back 365 days if that is important. There is no vacuum work required or refreshing of a materialized view. There is some coding in your ETL flow to recognize the need for a new table and a change to the UNIONing view.

answered Sep 17 '22 at 03:19

Bill Weiner

8,835
2
7
18

To be clear there would be 13, monthly tables to cover a 365 day period. Only the latest table, which has new data added to it, needs to be maintained. The others are static and cause no normal maintenance. Users access a view that makes their queries see a single reference that has all the needed data. If you need to maintain the older (<1yr) data then why do you delete rows. If the point is to limit the range that some users see then just a view with a where clause will do that. – Bill Weiner Sep 17 '22 at 15:15
Since the base table contains billions of rows, we’re looking at having several data sources with different time periods (eg 30d, 120d, 365d, 1095d). Then a user can choose to where he query from. Instead of having a single view/source, we introduce multiple ones to minimize the exec time of the query. – thox Sep 18 '22 at 02:22
If the table is sorted by time (first sortkey) then the blocks of the table have metadata that make it very efficient to select time ranges. Blocks of the table have metadata which will allow Redshift to eliminate data before scan. If you need to ensure that users only select against a range of the large table use views with WHERE clauses defined. – Bill Weiner Sep 18 '22 at 05:52

score 0 · Answer 2 · answered Jun 23 '23 at 08:04

0

Use your own table.

MVs under the hood are a disaster.

Investigation is here;

https://www.redshiftresearchproject.org/white_papers/downloads/materialized_views.html

answered Jun 23 '23 at 08:04

Max Ganz II

149
5

Overhead of Redshift materialized view full REFRESH Vs Redshift VACUUM

2 Answers2