Partition pruning with impala and parquet

Question

We have a fact table we wish to partition by month. (This is because of our quantity of data, and wanting to hit partition file sizes that are at least 256mb as per parquet best practice). I guess if data increases we may want to go weekly.

The table will ALWAYS be queried for a specific day, and one day only. (It's a snapshot)

So; I tried a simple test. A basic table, with an integer date key, partitioned with integer date-month key.

I imagined that if i queried for 01/01/2011 then it would use the 01-2011 partition. Unfortunately it doesn't. The explain plan shows it scans both partitions.

I computed stats too - thinking the stats would know the min and max values of the date columns, and would therefore know not to hit one of the partitions, but this didn't change anything.

Is that expected? Maybe my example is too simplistic. Is the explain plan misleading? I can imagine many many use cases where you would filter by a single date field, but be partitioned by year and month, how is this supposed to work?

what is the format of the column where dates are stored? if they are strings you can just match it with `where column = '2011-01-01'` or if they are timestamp as I suppose, the function `WHERE YEAR(column) = 2012` and `AND DAY(column) = 1` and `AND MONTH(column) = 1`. But I am pretty sure there is an easier way. — SabDeM, Oct 21 '15 at 21:24
Actually we're flexible. what w're not flexible on is the querying engine - it must query by the full date, not by year(), month() and day() (which I imagine would work because we're exactly matching the partition key) — Codek, Oct 22 '15 at 06:35

Partition pruning with impala and parquet

0 Answers0