If your table is large(determined by "set hive.mapjoin.smalltable.filesize;"), you cannot do a map side join. Except that your tables are bucketed and sorted, and you turned on "set hive.optimize.bucketmapjoin.sortedmerge = true", then you can still do a map side join on large tables. (Of course, you still need "set hive.optimize.bucketmapjoin = true")
Make sure that your tables are truly bucketed and sorted on the same column. It's so easy to make mistakes. To get a bucketed and sorted table, you need to
- set hive.enforce.bucketing=true;
- set hive.enforce.sorting=true;
DDL script
CREATE table XXX
(
id int,
name string
)
CLUSTERED BY (id)
SORTED BY (id)
INTO XXX BUCKETS
;
INSERT OVERWRITE TABLE XXX
select * from XXX
CLUSTER BY member_id
;
Use describe formatted XXX
and look for Num Buckets, Bucket Columns, Sort Columns
to make sure it's correctly setup.
Other requirements for the bucket join is that two tables should have
- Data bucketed on the same columns, and they are used in the ON clause.
- The number of buckets for one table must be a multiple of the number of buckets for the other table.
If you meet all the requirements, then the MAP join will be performed. And it will be lightning fast.
By the way, SMB Map Join is not well supported in Hive 1.X for ORC format. You will get a null exception. The bug has been fixed in 2.X.