2

I have a table that is partitioned by date in the format yyyyMMdd. If I do a simple query like this:

SELECT COUNT(*) FROM MyTable WHERE Date >= '20140924'

then it will scan 3 days of data (today is the 26th). However I would like my query to always look at the last 3 days, so I write it like this

SELECT COUNT(*) FROM MyTable
WHERE date >= from_unixtime(unix_timestamp() - 259200, 'yyyyMMdd')

The problem is that now it scans every partition. Is there a way to get it to pre-calculate the part of the query after the greater than?

MikeKulls
  • 873
  • 1
  • 10
  • 22

2 Answers2

1

Unfortunately Hive doesn't support that. I have faced the similar problem in past and my hive table had partitions for last 2 years.

However one workaround you can do is that you can run this hive query inside a shell script where you will compute this date in another variable and use it as a variable in hive query. An example script would be:

#!/bin/bash

date=`date +"%Y%m%d" -d "-3 days"`
hive -e "select count(*) from MyTable where date >= '$date'"
Amar
  • 3,825
  • 1
  • 23
  • 26
  • I'm thinking while this isn't ideal it's probably the best solution. Generally when I've written sql in the past it's always been called from some other language where stuff like this is easy to do. No reason it will be any different with hive. – MikeKulls Sep 29 '14 at 01:17
  • also I tried various other things to get it done in Hive itself by using joins etc but nothing really worked out....so where all its possible to use shell I try to use it for running queries – Amar Sep 29 '14 at 04:57
0

One workaround is to create a table with a single row where you can select your calculated time from.

CREATE TABLE dual (dummy STRING);
INSERT INTO TABLE dual SELECT count(*) FROM dual;

SELECT COUNT(*) FROM MyTable t1
JOIN (
    SELECT from_unixtime(unix_timestamp() - 259200, 'yyyyMMdd') myDate
    FROM dual LIMIT 1
) t2 ON (t2.myDate = t1.Date)
FuzzyTree
  • 32,014
  • 3
  • 54
  • 85
  • I gave this a try but it still appeared to scan every partition. Maybe it's a version thing? I'm using hive 0.12 – MikeKulls Sep 29 '14 at 01:13
  • @MikeKulls interesting, can you check if it still scans every partition if you move the date to a subquery i.e. `SELECT COUNT(*) FROM MyTable WHERE date >= (select from_unixtime(unix_timestamp() - 259200, 'yyyyMMdd') from dual limit 1)` – FuzzyTree Sep 29 '14 at 02:41
  • I can't get that to work at all. I get `cannot recognize input near 'SELECT'`. I've tried various sub-queries, even as simple as `SELECT '20140926' AS DT FROM dual` – MikeKulls Sep 29 '14 at 03:16