1

I am running a hive query like tableA left join tableB on tableA.col1=tableB.col1 and tableA.col2=tableB.col2. tableA is having 1.8 billion data and tableB is having 31 million records. Now the last reducers in my join is not getting completed and it is running for long.

It may be because of skew data. I did tried MAPJOIN and the query failed because of huge data volume for tableA. Is there any other options these can be handled in a better way?

The task which I can see running for long is as below

reduce > copy task(attempt_1498868574233_185232_m_001336_0 succeeded at 8.94 MB/s) Aggregated copy rate(1121 of 2532 at 108.94 MB/s)

What exactly it is trying to do in that step?

Koushik Chandra
  • 1,565
  • 12
  • 37
  • 73

1 Answers1

1

set hive.optimize.skewjoin to true

hive.optimize.skewjoin
Default Value: false
Added In: Hive 0.6.0
Whether to enable skew join optimization.

You might also want to tune hive.skewjoin.key, hive.skewjoin.mapjoin.map.tasks and hive.skewjoin.mapjoin.min.split.

hive.skewjoin.key
Default Value: 100000
Added In: Hive 0.6.0
Determine if we get a skew key in join. If we see more than the specified number of rows with the same key in join operator, we think the key as a skew join key.

hive.skewjoin.mapjoin.map.tasks
Default Value: 10000
Added In: Hive 0.6.0
Determine the number of map task used in the follow up map join job for a skew join. It should be used together with hive.skewjoin.mapjoin.min.split to perform a fine grained control.

hive.skewjoin.mapjoin.min.split
Default Value: 33554432
Added In: Hive 0.6.0
Determine the number of map task at most used in the follow up map join job for a skew join by specifying the minimum split size. It should be used together with hive.skewjoin.mapjoin.map.tasks to perform a fine grained control.

https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties

David דודו Markovitz
  • 42,900
  • 6
  • 64
  • 88