hive left outer join long running

Question

Hortonworks HDP 2.3.0 - Hive 0.14

Table T1 ( partition on col1, no bucket, ORC ) app 120 million rows & 6GB datasize Table T2 ( partition on col2, no bucket, ORC ) app 200 M rows & 6MB datasize

T1 left outer join on t2 ( t1.col3 = t2.col3 )

The above query is long running in the last reducer phase in both tez & mr mode. I also tried auto convert true / false & explicit mapjoin.

Still the query is running in the last reducer phase, never ending.

FYI - If data size of T2 is either 9k or 1GB, the query finishes.

Um. Some more verbs and explanatory text would be nice. The question is rather hard to understand as it stands. — Mark Dickinson, May 23 '16 at 18:33
I am trying to perform left outer join with big table ( 6 GB) vs small table ( 6 MB). In both tez & mr mode, this left outer join never finishes. In any kind if configuration ( auto convert true / auto convert false ) the execution is stuck in the last reducer stage of the MapReduce Job / TEZ job. — Venkat M, May 23 '16 at 19:53

score 1 · Answer 1 · answered May 24 '16 at 13:08

The problem maybe is that there are too many bytes/rows per reducer. If the application execution is stuck in the last single reducer then it's most probably data skew. To check it, select top 5 col3 from both tables, skew is when there are a lot of records with the same key value(say 30%). If it's a skew then try to join separately skew key then UNION ALL with all other keys join. Something like this:

select * from
T1 left outer join on t2 on ( t1.col3 = t2.col3 ) and t1.col3=SKEW_VALUE
union all
select * from
T1 left outer join on t2 on ( t1.col3 = t2.col3 ) and t1.col3<>SKEW_VALUE

If the application execution is stuck in the last reducer stage, not a single reducer or few reducers, then check bytes.per.reducer hive setting, maybe it's too high.

set hive.exec.reducers.bytes.per.reducer=67108864;

Hi All Thanks for your response. There was huge duplicate values on join key (col3), as a result 37 TB of data is being written which makes reducer to run for long time. After filtering the duplicates, the query ran better — Venkat M, May 25 '16 at 17:52

score 0 · Answer 2 · answered May 24 '16 at 19:32

0

But have you tried giving size to auto convert join, try giving size > than of small table that can be fit into memory.

set hive.auto.convert.join.noconditionaltask.size = 10000000;

answered May 24 '16 at 19:32

sumitya

2,631
1
19
32

hive left outer join long running

2 Answers2

Linked