1

I'm working with a Hive table that is partitioned by year, month, and day. e.g.

year=2015 AND month=201512 AND day = 20151231.

From my limited knowledge of the way Hive works, these are probably set up in a folder structure where the '2015' folder contains 12 month folders, and each month folder has 28-31 day folders inside. In that case, using

WHERE year = 2015 AND month = 201512 AND day = 20151231

would just climb down the directory structure to the 20151231 folder. I would think that using just WHERE day = 20151231 would trigger the same traversal, and therefore be essentially the same query, but we were given sample code which used the year AND month AND day format (i.e. referencing all 3 partitions).

I ran some benchmarks using both options (last night and this morning, when server load is extremely light-to-non-existent), and the time taken is essentially the same. I suspect that the sample code is wrong, and I can just use the day partition, but I want to be sure.

Is there any performance advantage to using several partitions that are subsets of each other in a Hive query?

I know that Hive partitions are treated like columns, but would the same hold true for a non-partitioned column?

bsg
  • 825
  • 2
  • 14
  • 34

1 Answers1

0

When you are running a query like that on a partitioned table, hive will first query the metastore to find which directories have to be included in the map/reduce input and like you saw, it doesn't quite matter how they are arranged ( day=20151231 vs year=2015/month=12/day=31 ). If you're using mysql for the metastore, it means hive internally will run a sql query to its database to retrieve only the partitions to query. The difference in performance in this SQL query is also going to be negligible, especially compared to the duration of your map/reduce job. It's quite different when using non-partition columns, since those are not stored in the metastore, but a full scan of the data is needed.

Roberto Congiu
  • 5,123
  • 1
  • 27
  • 37