-2

I would like to join two tables that have one common column and the same number of buckets with same sorting.

Other than that setting do I need to setup any other conditions other than setting up the properties?

set hive.optimize.bucketmapjoin = true;
set hive.optimize.bucketmapjoin.sortedmerge = true;
set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;
J0e3gan
  • 8,740
  • 10
  • 53
  • 80

1 Answers1

0

If you have two datasets that are too large for a map side join,an efficient technique for joining them is to sort the two datasets into buckets.

The trick is to cluster and sort by the same join key.
CREATE TABLE order(int,price float,quantity int) CLUSTERED BY(cid) INTO 32 BUCKETS;

CREATE TABLE customer (id int, first string, last string) CLUSTERED BY(id) INTO 32 BUCKETS;

This provides two major optimization benefits :

Sorting by join key makes joins easy ,all possible matches value resides on the same area on disk 

Hash bucketing a join  key ensures all matching values reside on same node ,equi join can then run with no shuffle .
user3484461
  • 1,113
  • 11
  • 14