How partitioning and clustered by works in Hive table?

Question

I'm trying to understand below query by using that how data is going to be placed.

CREATE TABLE mytable ( 
         name string,
         city string,
         employee_id int ) 
PARTITIONED BY (year STRING, month STRING, day STRING) 
CLUSTERED BY (employee_id) INTO 256 BUCKETS

The keyword PARTITIONED BY will distribute the data in below like dir structure.

/user/hive/warehouse/mytable/y=2015/m=12/d=02

But am not able to understand, how employee_id will be distributed among these directories ? 256 buckets (files) will be created, and all those files will be having all employee_id but which file will sit under which dir, how that will be decided ?

Can anyone help me to understand this ?

leftjoin · Accepted Answer · 2021-04-10T19:23:12.163

Directories are partitions or table location. Buckets are files inside these directories.

Complex partitions are hierarchical directories. In your case:

`/user/hive/warehouse/mytable/` - Table location, contains partition directories:
  `y=2015/` - year partition directory, contains months directories: 
    `m=12/` - month partition, contains days partitions directories:
      `d=02/` - day partition, contains 256 files(buckets)
        00000
        ...
        00255
     `d=03/` -Each day partition will contain 256 files (if you have enough data)
        00000
        ...
        00255

Each file will contain not all employee_id. Which record will sit in which file is decided using this formula:

 bucket_number=hash_function(employee_id) MOD 256

Where hash_function is integer, in case of Int employee_id it is equal to employee_id.

256 - is the number of buckets

MOD 256 will produce integer values in the range [0..255], corresponding to bucket numbers.

The same id's will always be in the same buckets. Each daily partition will contain it's own files(buckets), up to 256 buckets in each.

Say, employee_id = 1024 goes into bucket 0, if the same employee_id exists in many days, it will be in file 00000 in each day directory.

employee_id=1050 goes into file 000026 because 1050 MOD 256 = 26.

So, first the data is partitioned by partition key, inside partitions it is bucketed (distributed between files).

How partitioning and clustered by works in Hive table?

1 Answers1