I am new to hive and was reading about Bucketing and MapSide joins
"Map joins can take advantage of bucketed tables (Buckets), since a mapper working on a bucket of the left table only needs to load the corresponding buckets of the right table to perform the join. The syntax for the join is the same as for the in-memor...."
Suppose i create a table as
CREATE TABLE bucketed_users (id INT, name STRING) CLUSTERED BY (id) INTO 4 BUCKETS;
my questions are
1> whether all 4 Buckets will have the same size ? or will it depend on the frequency of id in data ? ie if an id repeats a lot the rellated bucket will have higher size than other buckets.
2> will there be a scenario where a data related to a id will be present in 2 different buckets ? ie one record for an id is present in bucket 1 and another record in bucket 4.
if yes then how will optimizer work with the bucketed data ?
if any one has tried this it will be great if they can share their experience.