I am quite new to Redshift but have quite some experience in the BI area. I need help from an expert Redshift developer. Here's my situation: I have an external (S3) database added to Redshift. This will suffer very frequent changes, approx. every 15 minutes. I will run a lot of concurrent queries directly from Qlik Sense against this external DB. As best practices say that Redshift + Spectrum works best when the smaller table resides in Redshift, I decided to move some calculated dimension tables locally and leave the outer tables in S3. The challenge I have is if it's better suited to use materialized views for this or tables.
I already tested both, with DIST STYLE = ALL and proper SORT KEY and the test show that MVs are faster. I just don't understand why that is. Considering the dimension tables are fairly small (<3mil rows), I have the following questions:
- shall we use MVs and refresh them via scheduled task or use table and perform some form of ETL via stored procedure (maybe) to refresh it.
- if table is used: I tried casting the varchar keys (heavily used in joins) to bigint to force encoding to AZ64, but queries perform worse than without casting (where encode=LZO). Is this because in the external DB it's stored as varchar?
- if MV is used: I also tried above casting in the query behind MV, but the encoding says NONE (figured out by checking the table created behind the scene). Moreover, even without casting, most of the key columns used in joins have no encoding. Might it be that this is the reason why MVs are faster than table? And should I not expect the opposite - no encoding = worse performance?
Some more info: in S3, we store in the form of parquet files, with proper partitioning. In Redshift, the tables are sorted against the same columns as S3 partitioning, plus some more columns. And all queries use joins against these columns in the same order and also a filter in the where clause on these columns. So the query is well structured.
Let me know if you need any other details.