1

I am doing join operation in hive. But when reducer reaches 99% reducer gets stuck.

Then i identified that there is skew data in table. Ex. In table A there is 1 million data and table B has 10k only.In table A joining column has 80% values are same and rest is other. So hive reducer stuck at that value.

Here is my query :

INSERT INTO TABLE xyz SELECT m.name, m.country, m.user_type, m.category FROM A m JOIN category n ON (m.name = n.name) where country=2 GROUP BY m.name, m.country, m.user_type, m.category;

So please suggest possible solution. How can i process join operation on this kind of data.

Suresh Parmar
  • 815
  • 7
  • 15

3 Answers3

2

Starting in Hive 0.10.0, tables can be created as skewed or altered to be skewed (in which case partitions created after the ALTER statement will be skewed). In addition, skewed tables can use the list bucketing feature by specifying the STORED AS DIRECTORIES option. See the DDL documentation for details: Create Table, Skewed Tables, and Alter Table Skewed or Stored as Directories

For reference use this link.

Vikas Hardia
  • 2,635
  • 5
  • 34
  • 53
0

Found some solution for above problem.

Set below parameters to hive before execution hive join.

set hive.optimize.skewjoin=true;
set hive.skewjoin.key=100000;
set hive.skewjoin.mapjoin.map.tasks=10000;
set hive.skewjoin.mapjoin.min.split=33554432;
set hive.exec.parallel=true;
set hive.exec.parallel.thread.number=8;
set hive.vectorized.execution.enabled = true;
set hive.vectorized.execution.reduce.enabled = true;
set hive.vectorized.execution.reduce.groupby.enabled = true;
set hive.cbo.enable=true;
set hive.compute.query.using.stats=true;
set hive.stats.fetch.partition.stats=true;
set hive.stats.fetch.column.stats=true;
set hive.stats.autogather=true;
set mapred.output.compress=true;
set hive.exec.compress.output=true;
set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;
set hive.auto.convert.join=false; 

Few parameters needs to change according to your data size and cluster size.

Suresh Parmar
  • 815
  • 7
  • 15
  • 1
    It would help if this answer had references as to why this worked. For example, four of the parameters apparently turn on cost-based optimization: https://hortonworks.com/blog/5-ways-make-hive-queries-run-faster/ – dfrankow Nov 06 '17 at 15:36
0

you can try the MapJoin, Just like below:

set hive.auto.convert.join = true;
set hive.mapjoin.smalltable.filesize=25000000; -- This default value is 25MB, you can change it.
StrongYoung
  • 762
  • 1
  • 7
  • 17