Five way joins in hive are of course possible and also (naturally) likely slow to very slow.
You should consider co-partitioning the tables on
- identical partition columns
- identical number of partitions
Other options include hints. For example consider if one of the tables were large and the others small. You may then be able to use streamtble hint
Assuming a is large:
SELECT /*+ STREAMTABLE(a) */ a.val, b.val, c.val, d.val, e.val
FROM a JOIN b ON (a.key = b.key1) JOIN c ON (c.key = b.key1) join d on (d.key = c.key) join e on (e.key = d.key)
Adapted from : https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Joins
:
All five tables are joined in a single map/reduce job and the values
for a particular value of the key for tables b, c,d, and e are
buffered in the memory in the reducers. Then for each row retrieved
from a, the join is computed with the buffered rows. If the
STREAMTABLE hint is omitted, Hive streams the rightmost table in the
join.
Another hint is the mapjoin that is useful to cache small tables in memory.
Assuming a is large and b,c,d,e are small enough to fit in memory of each mapper:
SELECT /*+ MAPJOIN(b,c,d,e) */ a.val, b.val, c.val, d.val, e.val
FROM a JOIN b ON (a.key = b.key1) JOIN c ON (c.key = b.key1)
join d on (d.key = c.key) join e on (e.key = d.key)