Hive - optimize multiple table joins

Question

I need to join multiple tables in a single query and then overwrite another table.

Focus/Driver table: FACT (huge, bucketed on ID)
Join Table 1: T1 (big, but smaller than FACT, bucketed on ID and joined with FACT on FACT.ID)
Join Table 2: T2 (big, but smaller than T1 and joined with FACT on FACT.ID2)
Join Table 3: T3 (reference table, small enough to fit into memory, joined to FACT)
Join Table 4: T4 (reference table, small enough to fit into memory, joined to FACT)
Join Table 5: T5 (reference table, small enough to fit into memory, joined to FACT)

Now, I wanted to know what is the sequence of tables to be joined to achieve the best performance.

My thoughts and questions:

I want to first join FACT with T1 since both are bucketed. But is it a good idea to join the 2 big tables first since this huge joined dataset will be joined with the smaller ones (which means more data to be moved between mappers and reducers) or should we join with the smaller tables first? But, if we join the FACT with the smaller tables, I will not be able to perform a bucket join with T1, right (since the joined dataset will not be bucketed).

Here is a good link https://cwiki.apache.org/confluence/display/Hive/LanguageManual+JoinOptimization#LanguageManualJoinOptimization-StarJoinOptimization . Are you doing any filtering on values that come from T3, T4, T5? If yes, then definitely MAP-JOIN them to FACT to filter out as much as possible. The fact that tables are bucketed will not help joins, they also need to be sorted, see the link for more details. — alexeipab, Apr 02 '16 at 23:07
Thanks alexeipab. The tables are bucketed and sorted (i forgot to mention they are sorted). I am not doing any filter on tables T3, T4 and T5. But even if i was, if joining the big tables (FACT and T1) were of a higher priority, then I would join these first and then join the remaining tables. If I joined the smaller tables first (and filtered out the unnecessary rows), and then perform join with the bucketed table, it will not help performance because the data would no longer be bucketed. Thanks for your input. — user3031097, Apr 05 '16 at 23:42

Hive - optimize multiple table joins

0 Answers0