I need to join multiple tables in a single query and then overwrite another table.
- Focus/Driver table: FACT (huge, bucketed on ID)
- Join Table 1: T1 (big, but smaller than FACT, bucketed on ID and joined with FACT on FACT.ID)
- Join Table 2: T2 (big, but smaller than T1 and joined with FACT on FACT.ID2)
- Join Table 3: T3 (reference table, small enough to fit into memory, joined to FACT)
- Join Table 4: T4 (reference table, small enough to fit into memory, joined to FACT)
- Join Table 5: T5 (reference table, small enough to fit into memory, joined to FACT)
Now, I wanted to know what is the sequence of tables to be joined to achieve the best performance.
My thoughts and questions:
I want to first join FACT with T1 since both are bucketed. But is it a good idea to join the 2 big tables first since this huge joined dataset will be joined with the smaller ones (which means more data to be moved between mappers and reducers) or should we join with the smaller tables first? But, if we join the FACT with the smaller tables, I will not be able to perform a bucket join with T1, right (since the joined dataset will not be bucketed).