Does order of partitioning columns matter in Hive?

Question

Lets say I have a partitioned table with multiple columns as partition keys e.g.

partitioned by (department string,year int, month int,day int)

So does this specific order really matter? All the online resources refer to advantage of scanning only specific sub-directories for search. But ultimately everything is a file in big data, directories seem to be more like logical grouping. And when one specifies a filter on partitioned column, hive just needs to know which files are involved and where they are located, not sure how directory is going to be useful -- it's not like as if directories are loaded in memory -- files are loaded in memory -- and the directory path is more like a label for a given file. If that's the case, no matter which order we specify for partitioning , it shouldn't matter. This is especially more evident in HDInsight where the underlying file system (BLOBs) has no concept of directory.

_"directories seem to be more like logical grouping"_ >> right. Now, can you figure out which one is more efficient, reading 100 files or reading just 2? And which one has the higher latency, accessing a local file or an HDFS file or a file in Object Storage? — Samson Scharfrichter, Oct 07 '19 at 15:26
Similar musing: why do relational databases have indexes? "Ultimately", the data is in tables, so why bother with additional stuff? Why bother about performance improvements? — Samson Scharfrichter, Oct 07 '19 at 15:29

mazaneicha · Accepted Answer · 2019-10-09T01:01:07.597

Although you're right about directories being logical constructs, if you consider the amount of metadata your HiveServer2 has to get and sift through in order to execute an average query, the order does matter. If a query contains ...WHERE department='IT'..., and the partitions are laid out as you show, given 100 departments total, the partition pruning mechanism will be able to eliminate 99 subdirectories from the tree right away. But if the order of partition columns is reversed, the same query will need to retrieve metadata for (30 days x 12 month x N years) partitions from Hive MetaStore, just to figure out whether partition /department=IT actually exists in all of them. So the order of partitions can be decided by analyzing predominant query patterns.

Another common factor to consider is devops/maintenance related, especially if data is loaded into a table incrementally. If one needs to backoff/recover from unsuccessful load, will he need to drop a partition (day=08) in each department subtree individually, or can all department data be cleared at once by dropping partition (day=08)?

When you say that using the reverse partition structure results in more metadata that your Hiveserver has to sift through, what metadata are you specifically talking about? filenames? — vi_ral, Jun 30 '20 at 17:45

Does order of partitioning columns matter in Hive?

1 Answers1