How to process skewed data using Hive?

Question

I am doing join operation in hive. But when reducer reaches 99% reducer gets stuck.

Then i identified that there is skew data in table. Ex. In table A there is 1 million data and table B has 10k only.In table A joining column has 80% values are same and rest is other. So hive reducer stuck at that value.

Here is my query :

INSERT INTO TABLE xyz SELECT m.name, m.country, m.user_type, m.category FROM A m JOIN category n ON (m.name = n.name) where country=2 GROUP BY m.name, m.country, m.user_type, m.category;

So please suggest possible solution. How can i process join operation on this kind of data.

http://stackoverflow.com/questions/32370033/hive-join-optimization/32373086#32373086 — Kishore, Mar 22 '16 at 12:07

score 2 · Answer 1 · answered Mar 22 '16 at 07:17

2

Starting in Hive 0.10.0, tables can be created as skewed or altered to be skewed (in which case partitions created after the ALTER statement will be skewed). In addition, skewed tables can use the list bucketing feature by specifying the STORED AS DIRECTORIES option. See the DDL documentation for details: Create Table, Skewed Tables, and Alter Table Skewed or Stored as Directories

For reference use this link.

answered Mar 22 '16 at 07:17

Vikas Hardia

2,635
5
34
53

Thanks for replay but i can not follow this approach for daily processing. – Suresh Parmar Mar 22 '16 at 14:03

score 0 · Answer 2 · answered May 30 '17 at 09:05

Found some solution for above problem.

Set below parameters to hive before execution hive join.

set hive.optimize.skewjoin=true;
set hive.skewjoin.key=100000;
set hive.skewjoin.mapjoin.map.tasks=10000;
set hive.skewjoin.mapjoin.min.split=33554432;
set hive.exec.parallel=true;
set hive.exec.parallel.thread.number=8;
set hive.vectorized.execution.enabled = true;
set hive.vectorized.execution.reduce.enabled = true;
set hive.vectorized.execution.reduce.groupby.enabled = true;
set hive.cbo.enable=true;
set hive.compute.query.using.stats=true;
set hive.stats.fetch.partition.stats=true;
set hive.stats.fetch.column.stats=true;
set hive.stats.autogather=true;
set mapred.output.compress=true;
set hive.exec.compress.output=true;
set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;
set hive.auto.convert.join=false;

Few parameters needs to change according to your data size and cluster size.

It would help if this answer had references as to why this worked. For example, four of the parameters apparently turn on cost-based optimization: https://hortonworks.com/blog/5-ways-make-hive-queries-run-faster/ — dfrankow, Nov 06 '17 at 15:36

score 0 · Answer 3 · answered Sep 10 '19 at 13:13

0

you can try the MapJoin, Just like below:

set hive.auto.convert.join = true;
set hive.mapjoin.smalltable.filesize=25000000; -- This default value is 25MB, you can change it.

answered Sep 10 '19 at 13:13

StrongYoung

762
1
7
17

How to process skewed data using Hive?

3 Answers3