1

I know that colocation is important for distributed joins in Netezza. At a high level, it has the following definition:

All data for joins are located in the same SPU

I also talked to some Netezza employees in the past and they mentioned that a join is considered colocated if all tables distribute and join on the same columns.

However, I still feel that definition is a bit lacking ... Based on my understanding of 1-phase and 2-phase GROUP BY's, I suspect colocation really operates on the following definition:

A join is considered colocated if the set of columns used in the join condition is a superset of the distribution keys of all participating tables.

Is that a correct definition? I tried searching for a precise definition of colocation in NZ but all I got is a bunch of articles that kinda assume you know the definition already.

Input on this would be appreciated. Thanks!

Edit: Based on ScottMcG's suggestion, I reformulated the definition of colocated join as:

1. It must be a HASH or MERGE SORT JOIN
2. Set of columns in join conditions must be superset of all distribution keys of all participating tables
3. ?

The ? mark for #3 is an ambiguity I need to iron out. Accordng to ScottMcG, the distribution keys of each table must also be joined with each other.

Suppose Tables A, B, C are distributed on text columns A.C1, A.C2, B.C3, B.C4, C.C5, and C.C6 and we have the following join.

SELECT * FROM A 
    INNER JOIN B "Join1"
        ON A.C1=B.C3
    INNER JOIN C "Join2"
        ON A.C2=B.C4
        AND A.C2=C.C6
        AND [X]

Now, let us provide a few possible definitions of [X]. Then for which definitions of [X] will Join2 be colocated?

(1) [X] = A.C2 = 5

(2) [X] = A.C2 = B.C1 OR A.C2 = C.C5

(3) [X] = A.C1 IS NULL

(4) [X] = A.NonKeyColumn1 = B.NonKeyColumn2
user1836155
  • 858
  • 14
  • 29

1 Answers1

1

For Netezza, a join is considered to be colocated when the tables involved in the join do not need to be redistributed or broadcast from the data slices on which they permanently reside in order to perform the join.

This can only happen if:

  • The set of a columns required by the join are a superset of the columns in the distribution key of each table
  • Each table participating in the join has the same set of columns as their distribution key.
  • The join is an equi-join.

These conditions are pretty close to what you propose in your definition, and are necessary to allow, but not sufficient to insure, a colocated join. It is possible that the optimizer might decide to pre-broadcast one of the tables if it were small enough even though they are distributed on the same columns, and that would then technically not be a colocated join.

One caveat I should add is that for a column to be considered the "same" as another column, the column values should hash to the same value. Generally speaking this means that the column data types would be the same. An exception is that the integer family of datatypes (byteint, smallint, int, bigint) will hash to the same value as long as they are in the supported range.

With regard to the effect of types of joins, equijoins would be of this form. Note that this could either be a hash join or a merge sort join (if the data types were perhaps floating point) under the covers. In either case, we don't need to redistribute the data. In these examples, both tables are distributed on COL1.

SELECT ...
FROM TableA A
   JOIN TableB B
   ON A.COL1 = B.COL1

If the join is an expression based join like either of the following, then you will end up with a redistribution or broadcast of the data. For the "less than" join, you have to be able to determine that 8 is less than 9, but since they will both be hashed to different data slices, they can only be compared if one is relocated to the other.

SELECT ...
FROM TableA A
   JOIN TableB B
   ON A.COL1 < B.COL1

SELECT ...
FROM TableA A
   JOIN TableB B
   ON A.COL1 -  B.COL1 = 0
ScottMcG
  • 3,867
  • 2
  • 12
  • 21
  • I see. So the definition I proposed is require #1. On #2, what if Table A is distributed on A.CustomerName, Table B is distributed on CardholderName, and we have Join on A.CustomerName = B.CardholderName? I assume that wouldn't be the co-localized then. Then one way to think of "same" column is that the conditions must be such that the LHS and RHS must have a PK-FK relationship or both being FK pointing to same PK? As for requirement #3, this is the first time I see it mentioned. I know it is required for HASH JOINs. So can I assume the pre-condition of colocated JOINs that it has to be HASH? – user1836155 Feb 10 '15 at 21:41
  • 1
    I edited my response to clarify on that just as you were making your comment, I think. For #2 it's only important that the contents in the columns hash to the same value. So for "Table A is distributed on A.CustomerName, Table B is distributed on CardholderName, and we have Join on A.CustomerName = B.CardholderName" as long as CustomerName and CardholderName are both varchar, then you could have a colocated join because "SMITH" in both would hash to the same value. For #3, let me add more to the answer to clarify. – ScottMcG Feb 10 '15 at 21:53
  • 1
    Interesting. So let me try to make a simple formalization on the requirements of colocation: (1) The JOIN has to be a HASH/MERGE SORT JOIN and (2) The set of a columns required by the join are a superset of the columns in the distribution key of each table. As for the JOIN types, is there a page that gives a good formal definitoin of their requirements and nature? The only thing I was able to find was that HASH JOIN requires equal signs and same types, EXIST JOIN is a special type of HASH JOIN and NESTED LOOP JOIN is baddddd. I can file a new question on this if necessary – user1836155 Feb 10 '15 at 22:55
  • 1
    I think in order to be formal you still need to specify that the columns used in the distribution key of TableA must be joined to columns used in the distribution key of TableB. For ex: TableA dist on Col1, TableB dist on Col2, select TableA A join TableB B on A.col1=B.col1 and A.col2=B.Col2 would satisfy your proposed #2, but would require a redistribution. It might be worth having a separate question on the join types (both the types implied by the SQL (equi-, expression-) and the types implemented by the code on the backed (hash, NL, exists, MS, etc.) – ScottMcG Feb 10 '15 at 23:46
  • Hmm, that's a new piece of information. So does this mean colocation is achieved if all distribution keys of TableA is joined to any distribution key of TableB? What about joining to constants? Let me update my post for an example – user1836155 Feb 11 '15 at 15:04
  • I'll try an update to reduce the ambiguity, and look at your updated example too. – ScottMcG Feb 11 '15 at 20:36