Are there any tradeoffs in partitioning using date as a yyyymmdd string versus having multiple partitions for year, month and day as integers?
-
I think the querying is easier to follow with the year/month/day in separate columns. Hive might get confused with some formulations for date ranges, for instance, and end up scanning all the data. – Gordon Linoff Apr 14 '16 at 14:48
1 Answers
For every partition that is created in hive, a new directory is created to store that partitioned data. These details are added to hive metastore as well as to the fsimage of hadoop. when a partition is created as yyyymmdd, will create a single directory, whereas with year,month and date will create three different directories. So more entries in hive metastore and more metadata to store in fsimage. This is wrt to how hive and hadoop see the partition for metadata perspective.
An another view wrt to querying I see is, when partitioned as yyyymmdd, it works well when querying on day(date) basis. Partitioning in year, month , day will give the flexibility to query the data at Year level and Month level effectively in addition to date level querying.

- 923
- 7
- 16
- 27