Multiple table join in hive

Question

I have migrated Teradata tables' data into hive .

Now I have to build summary tables on top of imported data. Summary table needs to be built from five source tables

If I go with joins I'll need to join five tables is it possible in hive ? or should I break the query in five parts? what should be advisable approach for this problem?

Please suggest

score 14 · Answer 1 · answered Mar 13 '15 at 20:49

Five way joins in hive are of course possible and also (naturally) likely slow to very slow.

You should consider co-partitioning the tables on

identical partition columns
identical number of partitions

Other options include hints. For example consider if one of the tables were large and the others small. You may then be able to use streamtble hint

Assuming a is large:

SELECT /*+ STREAMTABLE(a) */ a.val, b.val, c.val, d.val, e.val 
FROM a JOIN b ON (a.key = b.key1) JOIN c ON (c.key = b.key1) join d on (d.key = c.key) join e on (e.key = d.key)

Adapted from : https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Joins :

All five tables are joined in a single map/reduce job and the values for a particular value of the key for tables b, c,d, and e are buffered in the memory in the reducers. Then for each row retrieved from a, the join is computed with the buffered rows. If the STREAMTABLE hint is omitted, Hive streams the rightmost table in the join.

Another hint is the mapjoin that is useful to cache small tables in memory.

Assuming a is large and b,c,d,e are small enough to fit in memory of each mapper:

 SELECT /*+ MAPJOIN(b,c,d,e) */  a.val, b.val, c.val, d.val, e.val 
 FROM a JOIN b ON (a.key = b.key1) JOIN c ON (c.key = b.key1) 
 join d on (d.key = c.key) join e on (e.key = d.key)

hey thanks!! I'm looking into how can i improve the performace of hive join query — Chhaya Vishwakarma, Mar 17 '15 at 08:34
@chhayavishwakarma . Yes - and this answer provides those methods. — WestCoastProjects, Jul 19 '17 at 13:55

score 0 · Answer 2 · answered Mar 13 '15 at 15:28

0

Yes, you can join multiple tables in a single query. This allows many opportunities for Hive to make optimizations that couldn't be done if you broke it into separate queries.

answered Mar 13 '15 at 15:28

Jeremy Beard

2,727
1
20
25

Thanks Jeremy Beard !! I'm looking into how can i improve the performance of hive join query ,what will be best practice for doing such joins in optimized way – Chhaya Vishwakarma Mar 17 '15 at 08:37

Multiple table join in hive

2 Answers2