2

Hortonworks HDP 2.3.0 - Hive 0.14

Table T1 ( partition on col1, no bucket, ORC ) app 120 million rows & 6GB datasize Table T2 ( partition on col2, no bucket, ORC ) app 200 M rows & 6MB datasize

T1 left outer join on t2 ( t1.col3 = t2.col3 )

The above query is long running in the last reducer phase in both tez & mr mode. I also tried auto convert true / false & explicit mapjoin.

Still the query is running in the last reducer phase, never ending.

FYI - If data size of T2 is either 9k or 1GB, the query finishes.

leftjoin
  • 36,950
  • 8
  • 57
  • 116
Venkat M
  • 21
  • 2
  • Um. Some more verbs and explanatory text would be nice. The question is rather hard to understand as it stands. – Mark Dickinson May 23 '16 at 18:33
  • I am trying to perform left outer join with big table ( 6 GB) vs small table ( 6 MB). In both tez & mr mode, this left outer join never finishes. In any kind if configuration ( auto convert true / auto convert false ) the execution is stuck in the last reducer stage of the MapReduce Job / TEZ job. – Venkat M May 23 '16 at 19:53

2 Answers2

1

The problem maybe is that there are too many bytes/rows per reducer. If the application execution is stuck in the last single reducer then it's most probably data skew. To check it, select top 5 col3 from both tables, skew is when there are a lot of records with the same key value(say 30%). If it's a skew then try to join separately skew key then UNION ALL with all other keys join. Something like this:

select * from
T1 left outer join on t2 on ( t1.col3 = t2.col3 ) and t1.col3=SKEW_VALUE
union all
select * from
T1 left outer join on t2 on ( t1.col3 = t2.col3 ) and t1.col3<>SKEW_VALUE 

If the application execution is stuck in the last reducer stage, not a single reducer or few reducers, then check bytes.per.reducer hive setting, maybe it's too high.

set hive.exec.reducers.bytes.per.reducer=67108864;
leftjoin
  • 36,950
  • 8
  • 57
  • 116
  • Hi All Thanks for your response. There was huge duplicate values on join key (col3), as a result 37 TB of data is being written which makes reducer to run for long time. After filtering the duplicates, the query ran better – Venkat M May 25 '16 at 17:52
0

But have you tried giving size to auto convert join, try giving size > than of small table that can be fit into memory.

set hive.auto.convert.join.noconditionaltask.size = 10000000;

sumitya
  • 2,631
  • 1
  • 19
  • 32