1

I have a BIG table in Hive 0.13 - it has approx 250 GB of data per day. Per hour, it is, hence, approx, 10 GB of data. I have a BI Tool which would like to access this table's data on per day or per hour basis for which I need to test the queries which the BI tool would generate and run on Hive.

One of the queries, when BI is used for daily data for yesterday, looks like below:

select count(*) 
from my_table
where 
yyyy=year(date_sub(TO_DATE(FROM_UNIXTIME(UNIX_TIMESTAMP())),1))
and mm=month(date_sub(TO_DATE(FROM_UNIXTIME(UNIX_TIMESTAMP())),1))
and dd=day(date_sub(TO_DATE(FROM_UNIXTIME(UNIX_TIMESTAMP())),1))
;

My Table in Hive in MY_TABLE while YYYY, MM and DD are the partitioned columns in MY_TABLE. It is already stored in ORC Format.

The above query runs for very good amount of time, post which when I see the EXPLAIN EXTENDED output, I clearly see that it is doing a FULL TABLE SCAN of MY_TABLE irrespective of filter conditions.

How can we avoid this issue ?

Kindly advise.

Note again : Hive version is 0.13. We're in middle of an upgrade.

Thanks,

Suddhasatwa

Note:

The solution provided here (Why partitions elimination does not happen for this query?) is not applicable in my case, since I am using Hive 0.13 while CURRENT_DATE function is available only post Hive version 1.+.

Community
  • 1
  • 1
  • Possible duplicate of [Why partitions elimination does not happen for this query?](http://stackoverflow.com/questions/42376268/why-partitions-elimination-does-not-happen-for-this-query) – David דודו Markovitz Apr 20 '17 at 07:24
  • Please note the suggested work-around at the answer's comments – David דודו Markovitz Apr 20 '17 at 07:30
  • Hi @DuduMarkovitz : The solution you are refering to uses CURRENT_DATE function in Hive which is not available in Hive 0.13. Does this mean, i have no other option but to upgrade hive ? – Suddhasatwa Bhaumik Apr 20 '17 at 07:37
  • Maybe you can supply the date as a parameter from the BI tool – David דודו Markovitz Apr 20 '17 at 08:20
  • I tried it already - but this functionality is limited to whatever the back-end DB can support. E.g., If my database is Oracle, I can simply say SYSDATE-7 as a filter in BI to get data only for last 7 days. For Hive as well I can do similar, but in that case as well, I cannot use CURRENT_DATE since My Hive is still 0.13 !! :( – Suddhasatwa Bhaumik Apr 20 '17 at 09:50
  • In Oracle you can extract year, month, day using extract function and use them as parameters. See this reference: http://docs.oracle.com/cd/B19306_01/server.102/b14200/functions050.htm – leftjoin Apr 20 '17 at 09:56
  • Thanks! But the issue is not with Oracle. ;) – Suddhasatwa Bhaumik Apr 20 '17 at 10:03
  • using non-deterministic functions in `where` like yyyy=year(date_sub(TO_DATE(FROM_UNIXTIME(UNIX_TIMESTAMP())),1)) prevent query to be optimized, you need to use pre-calculated parameters instead. – leftjoin Apr 20 '17 at 11:07
  • Hello @leftjoin : Kindly advise how I can use pre-calculated parameters. Some examples would be great! – Suddhasatwa Bhaumik Apr 26 '17 at 15:01
  • See example in this answer: http://stackoverflow.com/a/41056389/2700344 – leftjoin Apr 26 '17 at 15:43
  • Many Thanks! However, the solution you pointed out to is already being used in our ETL Solutions based on UNIX Shell Scripts or Oracle Data Integrator, but it does not "fit" to our BI systems, like Tableau or SAP Business Objects! – Suddhasatwa Bhaumik May 03 '17 at 10:03

0 Answers0