I have huge data set related to transactions. I need to choose partitioning column from transaction_date(increases everyday) or state(limited number). which is the ideal choice and why?
2 Answers
The ideal choice is to have state as partitioning column as partitioning creates distinct folders based on distinct values. Hence number of folders = number of states and so the metadata information storage to Namenode would be less.
but if transaction date would be considered then each day there would be a new folder and that would reduce the performance of Namenode at some point of time.

- 485
- 4
- 10
Disadvantage of choosing transaction_date as partition column: (1) Too may small directories which may cause overhead in HDFS.
Advantages of using state: (1) Number of directories will be fixed.
It all depends how the query will be formed for execution. If your query contains filter clause for transaction_date and there is no partition then overall execution will be slow.
Also, creating partition does not guarantee faster execution. The search results will be returned faster for partitions where data volume is less compared to the partitions where data volume is high.

- 399
- 4
- 10