Performance of pyspark + hive when a table has many partition columns

Question

I am trying to understand the performance impact on the partitioning scheme when Spark is used to query a hive table. As an example:

Table 1 has 3 partition columns, and data is stored in paths like

year=2021/month=01/day=01/...data...

Table 2 has 1 partition column

date=20210101/...data...

Anecdotally I have found that queries on the second type of table are faster, but I don't know why, and I don't why. I'd like to understand this so I know how to design the partitioning of larger tables that could have more partitions.

Queries being tested:

select * from table limit 1

I realize this won't benefit from any kind of query pruning.

The above is meant as an example query to demonstrate what I am trying to understand. But in case details are important

This is using s3 not HDFS
The data in the table is very small, and there are not a large number of partitons
The time for running the query on the first table is ~2 minutes, and ~10 seconds on the second
Data is stored as parquet

My observations were based on S3 and simple queries selecting a small number of rows from the full take (i.e select * limit 1). These are not realistic queries but just testing. — GeorgeWilson, Dec 19 '21 at 23:08

leftjoin · Answer 1 · 2021-12-19T18:04:36.810

Except all other factors which you did not mention: storage type, configuration, cluster capacity, the number of files in each case, your partitioning schema does not correspond to the use-case.

Partitioning schema should be chosen based on how the data will be selected or how the data will be written or both. In your case partitioning by year, month, day separately is over-partitioning. Partitions in Hive are hierarchical folders and all of them should be traversed (even if using metadata only) to determine the data path, in case of single date partition, only one directory level is being read. Two additional folders: year+month+day instead of date do not help with partition pruning because all columns are related and used together always in the where.

Also, partition pruning probably does not work at all with 3 partition columns and predicate like this: where date = concat(year, month, day) Use EXPLAIN and check it and compare with predicate like this where year='some year' and month='some month' and day='some day'

If you have one more column in the WHERE clause in the most of your queries, say category, which does not correlate with date and the data is big, then additional partition by it makes sense, you will benefit from partition pruning then.

Right understood. The example is somewhat facetious, but I was using it as an example because in this case the equivalence of the two options is clear and I wanted to understand the performance. — GeorgeWilson, Dec 19 '21 at 14:13
It makes sense that traversing a deeper hierarchy would be expensive, but it seems surprising it would generally be noticable. What exactly happens during this operation and is it possible to get some more logs or metrics so I can understand what is causing the speed difference — GeorgeWilson, Dec 19 '21 at 14:16
Note that these tables are very small, which is why the difference feels large — GeorgeWilson, Dec 19 '21 at 14:16

Performance of pyspark + hive when a table has many partition columns

1 Answers1