Query partition with calculation and avoid full table scan

Question

I am an analyst trying to build a query to pull data of last 7 days from a table in Hadoop. The table itself is partitioned by date.

When I test my query with hard-coded dates, everything works as expected. However, when I write it to calculate based on today's timestamp, it's doing full table scan and I had to kill the job.

Sample query:

SELECT * FROM target_table 
WHERE date  >= DATE_SUB(FROM_UNIXTIME(UNIX_TIMESTAMP(),'yyyy-MM-dd'),7);

I'd appreciate some advice how I can revise my query while avoiding full table scan.

Thank you!

score 0 · Answer 1 · answered Feb 13 '15 at 04:03

0

I'm not sure that I have an elegant solution, but since I use Oozie for workflow coordination, I pass in the start_date and end_date from Oozie. In the absence of Oozie I might use bash to calculate the appropriate dates and pass them in as parameters.

Partition filters have always had this problem, so I just found myself a workaround.

answered Feb 13 '15 at 04:03

Pradeep Gollakota

2,161
16
24

Thank you Pradeep. Maybe I can look into Oozie as an option as well. – eyy Feb 13 '15 at 19:56

score 0 · Answer 2 · answered Feb 13 '15 at 12:04

0

I had some workaround, and it works for me if the no of Date is more than 30/60/90/120.

query like

(((unix_timestamp(date,'yyyy-MM-dd')) >= (unix_timestamp(date_sub(FROM_UNIXTIME(UNIX_TIMESTAMP(),'yyyy-MM-dd') ,${sub_days}),'yyyy-MM-dd'))) and((unix_timestamp(date,'yyyy-MM-dd')) <= (unix_timestamp(FROM_UNIXTIME(UNIX_TIMESTAMP(),'yyyy-MM-dd'),'yyyy-MM-dd'))))

sub_days = passing paramenter , here it could be 7

answered Feb 13 '15 at 12:04

Deb

473
3
13

Hi Debjit, I tried your approach but for some reason it still works like a table scan for me. Let me experiment a bit more. Appreciate it if you have other ideas. Thank you! – eyy Feb 14 '15 at 00:23
Appologies for sharing late to reply, but I guess, you can take help of querying the same using Hive Bucketing feature.these will help your problem. Am not that much workaround using Bucketing, but some sample have checked, and found the Bucketing concept have work better. [link](https://cwiki.apache.org/confluence/display/Hive/ListBucketing#ListBucketing-BasicPartitioning) – Deb Aug 06 '15 at 06:11

Query partition with calculation and avoid full table scan

2 Answers2