0

I have a table having year, month and day as partition. I am trying to find an optimized way to read the data for last n days using parameters. The only way I can do this at the moment is by specifying each of the combination of year, month, and day individually which is very problematic if we have to read a lot of data, say for 1 month. Below is a sample example.

select count(*) from table 
where (year = 2021 and month = 7 and day = 5)
or (year = 2021 and month = 7 and day = 4)
or (year = 2021 and month 7 and day 3)

I am interested in knowing the following.

  1. Can I use case when in where clause without impacting the performance? For example, will the below query read the same amount of data as the above query?
select count(*) from table 
where year = 2021 and month = 7 and (case when day between 4 and 7 then 1 else 0 end) = 1
  1. How does partition work behind the scenes? I believe that the query gets converted into a map reduce job before execution. Will the both codes mentioned above will be converted to same map reduce job?

  2. Can I use functions like case when freely with partitioned columns in where clause and will the hive query engine be able to interpret the function and scan the appropriate partitions?

  3. Is there any built in function in hive to know which partitions are getting hit by the query? If not, is there any workaround? Is there any way to know the same in presto?

mw981
  • 42
  • 4

1 Answers1

1

Partition pruning works fine with queries like this, the logic is like in your CASE expression:

where concat(year, '-', month, '-', day) >= '2021-07-04' 
      and 
      concat(year, '-', month, '-', day) <= '2021-07-07'

See this answer.

How to check how partition pruning works: Use EXPLAIN DEPENDENCY or EXPLAIN EXTENDED See this answer.

leftjoin
  • 36,950
  • 8
  • 57
  • 116
  • what is the difference between used partition and input partition? Does the query read some extra partitions then what are actually getting used? And if so, how does this extra read partition impact the performance? – mw981 Aug 08 '21 at 07:42
  • @mw981 Sorry did not get it. Partitions being read are in `"input_partitions":...` What do you mean "used"? Partion being read = used, isn't it? – leftjoin Aug 08 '21 at 10:08
  • it's regarding the second link you have [shared](https://stackoverflow.com/questions/67465968/will-hive-do-a-full-table-query-with-both-partition-conditions-and-not-partition/67466615#67466615) where you have mentioned that 'EXPLAIN EXTENDED also prints used partitions.' – mw981 Aug 08 '21 at 11:02
  • @mw981 does it show different set of partitions vs explain dependency? – leftjoin Aug 08 '21 at 12:28
  • for the code I ran it didn't. But I just wanted clarification on that statement. – mw981 Aug 09 '21 at 05:18
  • @mw981 Oh I see... I meant just that you can use that command too to get the same partition list. – leftjoin Aug 09 '21 at 06:26