4

Is Sort merge Bucket Join different from Sort Merge Bucket Map join? If so, what hints should be added to enable SMB join? How is SMBM join superior to SMB join?

Will "set hive.auto.convert.sortmerge.join=true" this hint alone be sufficient for SMB join? Else should the below hints be included as well.

set hive.optimize.bucketmapjoin = true set hive.optimize.bucketmapjoin.sortedmerge = true

The reason I ask is, the hint says Bucket map join, but MAP join is not performed here. I am under the assumption that both map and reduce tasks are involved in SMB while only map tasks are involved in SMBM.

Please correct me if I am wrong.

Bagavathi
  • 438
  • 2
  • 7
  • 17

1 Answers1

9

If your table is large(determined by "set hive.mapjoin.smalltable.filesize;"), you cannot do a map side join. Except that your tables are bucketed and sorted, and you turned on "set hive.optimize.bucketmapjoin.sortedmerge = true", then you can still do a map side join on large tables. (Of course, you still need "set hive.optimize.bucketmapjoin = true")

Make sure that your tables are truly bucketed and sorted on the same column. It's so easy to make mistakes. To get a bucketed and sorted table, you need to

  1. set hive.enforce.bucketing=true;
  2. set hive.enforce.sorting=true;
  3. DDL script

    CREATE table XXX ( id int, name string ) CLUSTERED BY (id) SORTED BY (id) INTO XXX BUCKETS ; INSERT OVERWRITE TABLE XXX select * from XXX CLUSTER BY member_id ;

Use describe formatted XXX and look for Num Buckets, Bucket Columns, Sort Columns to make sure it's correctly setup.

Other requirements for the bucket join is that two tables should have

  1. Data bucketed on the same columns, and they are used in the ON clause.
  2. The number of buckets for one table must be a multiple of the number of buckets for the other table.

If you meet all the requirements, then the MAP join will be performed. And it will be lightning fast.

By the way, SMB Map Join is not well supported in Hive 1.X for ORC format. You will get a null exception. The bug has been fixed in 2.X.

Community
  • 1
  • 1
GC001
  • 871
  • 8
  • 12
  • What about the partitioning? Is the bucketing applied to *each partition* separately ? In that case we would have great scalability: the complexity is *reduced* by a factor of `P*B` where P=#partitions and B=#buckets per partition. – WestCoastProjects Nov 09 '17 at 15:27
  • I didn't try but I think it should work for partitions as well, because the partition is just a path/directory with actual data files, which will be bucketed. – GC001 Nov 10 '17 at 04:57
  • After having read more - yes it should be applied to each partition: e.g `into 64 buckets` would be applied to *each* partition – WestCoastProjects Nov 10 '17 at 05:32
  • Yes, correct! Each partition has a specified number of buckets. So suppose there are 10 different partitions values for the partition key column and you define the number of buckets as 128, you will have a total of 1280 folders representing your whole dataset! – Avik Aggarwal Oct 24 '18 at 06:36