0

In a Hive table I have millions of rows, I would like to do a partition on one column 'id' which will be unique. So it is not a good practice to create a partition on that unique column because it will create so many number of files, and directories, which could slow down process, So is there a way to specify: create a partition on this 'id' column for every 10k records or 30k records. So that performance can be improved? for example:

 create table test(name string, note string) partitioned by(id int) 
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE 
LOCATION 'hdfs://somelocation/'

Also if there is a date type column is there can we do a partition on that column as with 'year and month only'? like PARTITIONED BY (year bigint, month bigint) or year and month together?

Nomad
  • 751
  • 4
  • 13
  • 34
  • 1
    Creating partitions on ID is not good practice. Even if you have few billions of row, its ok to have it in simple orc format table. But yes if you are loading data on daily basis than `date_partition` partition can be useful. Still you want to create partition, please provide schema of the table to identify the best partition column. – Sandeep Singh Oct 20 '17 at 15:15
  • Yes, mostly it will be a daily load. But I do not have a any such column as created_date in tables, but i have a column(STime) with date type. So creating partition on this STime column would be a good option, or adding new column with 'row_createdate' and adding partition as you said would be a good option? – Nomad Oct 20 '17 at 15:30
  • you can create on `STime` but analyze the data before proceeding with it to check if it contains sufficient number or rows. If you are not sure about it, going with `row_createdate` or any appropriate name which suits in your case would be better approach. – Sandeep Singh Oct 20 '17 at 15:50
  • Is there a possible way to create partition on any other column such as 'note' ? so that it will be partitioned by first 10k records? – Nomad Oct 20 '17 at 16:07
  • @SandeepSingh: can we do a partition on that column as with 'year and month only'? like PARTITIONED BY (year bigint, month bigint) or year and month together? – Nomad Oct 20 '17 at 18:08
  • If you had just one single massive load, you could use a kind of crude "hash partitioning", where the partitioning key is derived from, e.g. _"concatenate A-B-C then apply a fast hash then format the first 2 bytes as a hex string"_ => fixed number of partitions (e.g. 256×256) but random placement of rows. Trouble is, each daily load would be scattered in all these parts, resulting in very small files. Which is a bad practise. – Samson Scharfrichter Oct 20 '17 at 19:50
  • what would happen if million partitions are created. Will it cause the Hive to completely start working ? – Ravi Dec 20 '19 at 20:58

0 Answers0