I am running a hive query like tableA left join tableB on tableA.col1=tableB.col1 and tableA.col2=tableB.col2. tableA is having 1.8 billion data and tableB is having 31 million records. Now the last reducers in my join is not getting completed and it is running for long.
It may be because of skew data. I did tried MAPJOIN and the query failed because of huge data volume for tableA. Is there any other options these can be handled in a better way?
The task which I can see running for long is as below
reduce > copy task(attempt_1498868574233_185232_m_001336_0 succeeded at 8.94 MB/s) Aggregated copy rate(1121 of 2532 at 108.94 MB/s)
What exactly it is trying to do in that step?