I'm working with a Hive table that is partitioned by year, month, and day. e.g.
year=2015 AND month=201512 AND day = 20151231
.
From my limited knowledge of the way Hive works, these are probably set up in a folder structure where the '2015' folder contains 12 month folders, and each month folder has 28-31 day folders inside. In that case, using
WHERE year = 2015 AND month = 201512 AND day = 20151231
would just climb down the directory structure to the 20151231 folder. I would think that using just WHERE day = 20151231
would trigger the same traversal, and therefore be essentially the same query, but we were given sample code which used the year AND month AND day
format (i.e. referencing all 3 partitions).
I ran some benchmarks using both options (last night and this morning, when server load is extremely light-to-non-existent), and the time taken is essentially the same. I suspect that the sample code is wrong, and I can just use the day partition, but I want to be sure.
Is there any performance advantage to using several partitions that are subsets of each other in a Hive query?
I know that Hive partitions are treated like columns, but would the same hold true for a non-partitioned column?