0

Suppose I have 12 segments in my system, then suppose I create a table where I distribute by a column based on month of the year. Now suppose I have a second table and I do likewise with that table, simply using a create statement with DISTRIBUTED BY (month) and then I join these two tables on month as well as other attributes.

In the highly specific above scenario, is the performance of the query joining on month improved by the selection of month as a distribution column? Will records from October from both tables appear in the same segment auto-magically, or is it possible to specify the segment that these records will be distributed to such that they are collocated on the same segment? If they were and could be collocated, would there be any performance advantage anyway?

MattB
  • 2,203
  • 5
  • 26
  • 48

1 Answers1

2

Yes, if the data is collocated, the query performance will be better because there will be less data movement. Collocated meaning the distribution key from the two or more tables are the same data type(s).

But picking your distribution key is important. If you pick month and have 12 months of data, that means your table will only use 12 segments. That isn't a good distribution key. So don't use month!

You should pick the logical primary key for the distribution key so you have the best distribution across the segments. A table with transactions will probably have a transaction_id column which identifies each transaction. That is the ideal distribution column.

An example where you want to use a different distribution key than the logical primary key may be two different log tables that both have a unique, but different, log_id. But these two tables may have other attributes that are the same like product_id, customer_id, and transaction_id and this is how you join the two tables. You may want distribute both of these tables by these three columns so that the data will be collocated and a query joining the two tables will perform better than if the data wasn't collocated.

Make sure you don't have data skew when you do this because that would defeat the intent of making things faster. Good luck!

Jon Roberts
  • 2,068
  • 1
  • 9
  • 11
  • Jon hit it on the head. This article is helpful background about MPP characteristics : https://dwarehouse.wordpress.com/2012/12/28/introduction-to-massively-parallel-processing-mpp-database – Kyle Dunn Jan 25 '18 at 21:44
  • Well, OP said he has 12 segments. So picking 12 months will work - in this case. But when the database is expanded, the additional segments will not be used. – A. Scherbaum Jan 26 '18 at 13:25