I am trying to understand the performance impact on the partitioning scheme when Spark is used to query a hive table. As an example:
Table 1
has 3 partition columns, and data is stored in paths like
year=2021/month=01/day=01/...data...
Table 2
has 1 partition column
date=20210101/...data...
Anecdotally I have found that queries on the second type of table are faster, but I don't know why, and I don't why. I'd like to understand this so I know how to design the partitioning of larger tables that could have more partitions.
Queries being tested:
select * from table limit 1
I realize this won't benefit from any kind of query pruning.
The above is meant as an example query to demonstrate what I am trying to understand. But in case details are important
- This is using s3 not HDFS
- The data in the table is very small, and there are not a large number of partitons
- The time for running the query on the first table is ~2 minutes, and ~10 seconds on the second
- Data is stored as parquet