0

I am new to hive and was reading about Bucketing and MapSide joins

"Map joins can take advantage of bucketed tables (Buckets), since a mapper working on a bucket of the left table only needs to load the corresponding buckets of the right table to perform the join. The syntax for the join is the same as for the in-memor...."

Suppose i create a table as

CREATE TABLE bucketed_users (id INT, name STRING) CLUSTERED BY (id) INTO 4 BUCKETS;

my questions are

1> whether all 4 Buckets will have the same size ? or will it depend on the frequency of id in data ? ie if an id repeats a lot the rellated bucket will have higher size than other buckets.

2> will there be a scenario where a data related to a id will be present in 2 different buckets ? ie one record for an id is present in bucket 1 and another record in bucket 4.

if yes then how will optimizer work with the bucketed data ?

if any one has tried this it will be great if they can share their experience.

kamal
  • 15
  • 6

1 Answers1

1

This question was asked a while ago, but since it was still one of the first hits maybe this will help other people.

In most cases, just creating a table will make a single flat file on your cluster. Any query you run in Hive is translated into a Map and Reduce job. By definition you need keys and values in order to enter the reduce stage, and each mapper must search through a chunk of the single flat file to discover the given keys and values.

When using a bucketed table, Hive comes up with a hash of the clustered by value (here you use id) and splits the single table into that many flat files. Because the table is split up by the hashes of the id's the size of each split is based on the values in your table. If you have no values that would get mapped to the 3rd bucket, that flat file would be empty. You will also never have any given id showing up in more than one bucket, UNLESS you forget to set hive.enforce.bucketing = true before an insert statement.

Bucketing works greats for sampling data and speeding up select statements that use the cluster by's column as the search criteria. There are also improvements noticed when you join on two tables on their clustered columns when then have the same (or multiples of) buckets.

Hopefully this clears up some confusion.

Ryan Bedard
  • 421
  • 3
  • 5