redshift optimization: reduce i/o skew, or stick with dist key used in joins?

Question

Generally speaking, when there is a table whose dist key results in a high degree of I/O skew but is also used in join operations between that table and others (which are also distributed using the same key), is it better to choose a different dist key (or "even") to reduce the skew, even though this will result in more rows shifting between nodes during the join?

Or is it case-by-case and not possible to give a general answer?

What do you mean by "I/O skew"? Do you mean an uneven amount of data is stored on each slice because there are very few unique values used in that particular column? I don't think that `even` would improve things, because that probably increases the likelihood of JOINed data _not_ being on the same slice. — John Rotenstein, Aug 20 '19 at 03:54
Yes, that's what I meant. I think you're right. For the query joining on this column I'd need to redistribute both tables on a new column. Thing is, the skewed table is also used in other queries that join on a *different* column; using "even" would probably help those queries (since they're moving data around anyway) while hurting the original one. C'est la vie! — jph, Aug 21 '19 at 12:59

score 0 · Answer 1 · answered Aug 25 '19 at 14:33

Generally speaking you want to make sure that when using DISTKEY it is on a "unique"(not a date or some type of a dimension) field. This way you won't have stacking of the same type of hash on any particular node/slice(data skew).

It is definitely worth checking out how much of the data is skewed and what is the % by using the following query:

SELECT SCHEMA schemaname,
       "table" tablename,
       table_id tableid,
       size size_in_mb,
       CASE
         WHEN diststyle NOT IN ('EVEN','ALL') THEN 1
         ELSE 0
       END has_dist_key,
       CASE
         WHEN sortkey1 IS NOT NULL THEN 1
         ELSE 0
       END has_sort_key,
       CASE
         WHEN encoded = 'Y' THEN 1
         ELSE 0
       END has_col_encoding,
       CAST(max_blocks_per_slice - min_blocks_per_slice AS FLOAT) / GREATEST(NVL (min_blocks_per_slice,0)::int,1) ratio_skew_across_slices,
       CAST(100*dist_slice AS FLOAT) /(SELECT COUNT(DISTINCT slice) FROM stv_slices) pct_slices_populated
FROM svv_table_info ti
  JOIN (SELECT tbl,
               MIN(c) min_blocks_per_slice,
               MAX(c) max_blocks_per_slice,
               COUNT(DISTINCT slice) dist_slice
        FROM (SELECT b.tbl,
                     b.slice,
                     COUNT(*) AS c
              FROM STV_BLOCKLIST b
              GROUP BY b.tbl,
                       b.slice)
        WHERE tbl IN (SELECT table_id FROM svv_table_info)
        GROUP BY tbl) iq ON iq.tbl = ti.table_id;

Visit this analyze link for more details!

That being said, if you see a large skew % and you know that the DISTKEY is causing it then I would suggest removing that and using DISTSTYLE EVEN and potentially look into SORT KEYS.

redshift optimization: reduce i/o skew, or stick with dist key used in joins?

1 Answers1