I know that colocation is important for distributed joins in Netezza. At a high level, it has the following definition:
All data for joins are located in the same SPU
I also talked to some Netezza employees in the past and they mentioned that a join is considered colocated if all tables distribute and join on the same columns.
However, I still feel that definition is a bit lacking ... Based on my understanding of 1-phase and 2-phase GROUP BY's, I suspect colocation really operates on the following definition:
A join is considered colocated if the set of columns used in the join condition is a superset of the distribution keys of all participating tables.
Is that a correct definition? I tried searching for a precise definition of colocation in NZ but all I got is a bunch of articles that kinda assume you know the definition already.
Input on this would be appreciated. Thanks!
Edit: Based on ScottMcG's suggestion, I reformulated the definition of colocated join as:
1. It must be a HASH or MERGE SORT JOIN
2. Set of columns in join conditions must be superset of all distribution keys of all participating tables
3. ?
The ? mark for #3 is an ambiguity I need to iron out. Accordng to ScottMcG, the distribution keys of each table must also be joined with each other.
Suppose Tables A, B, C are distributed on text columns A.C1, A.C2, B.C3, B.C4, C.C5, and C.C6 and we have the following join.
SELECT * FROM A
INNER JOIN B "Join1"
ON A.C1=B.C3
INNER JOIN C "Join2"
ON A.C2=B.C4
AND A.C2=C.C6
AND [X]
Now, let us provide a few possible definitions of [X]. Then for which definitions of [X] will Join2 be colocated?
(1) [X] = A.C2 = 5
(2) [X] = A.C2 = B.C1 OR A.C2 = C.C5
(3) [X] = A.C1 IS NULL
(4) [X] = A.NonKeyColumn1 = B.NonKeyColumn2