1

I have a Hive table and I need to run a query similar to the one below for different values of the parameters date,identifier1,identifier2,lower and upper and union the results together.

Select 
col1,
col2,
new_time,
sum(col3),
case 
when "date" between date1 and date2 then 'No'
when "date" between date3 and date4 then 'Yes'
end as date_group,
case when "date" < e then 'test1' else 'test2' end as test_group,
'identifier1' as ID,
'identifier2' as ID2
FROM Table1
WHERE (new_time between time1 and time2)
      AND (tag between 'lower' and 'upper')
GROUP BY 
col1,
col2,
new_time,
case 
when "date" between date1 and date2 then 'No'
when "date" between date3 and date4 then 'Yes'
end,
case when "date" < e then 'test' else 'test2' end 

My initial idea was to create the parameter table below and loop through each row which holds combination of parameter values and union the results.

+------------+-------------+-------------+--------+-------+
|    date    | identifier1 | identifier2 | lower  | upper |
+------------+-------------+-------------+--------+-------+
| 2019-05-12 |           1 | A           |     10 |    20 |
| 2019-07-10 |           2 | B           |     30 |    40 |
| 2019-04-10 |           3 | C           |     60 |    70 |
| 2019-04-11 |           4 | D           |    423 |   500 |
| 2019-07-10 |           5 | E           |     85 |    88 |
+------------+-------------+-------------+--------+-------+

Two problems, I'm not sure how to go about this and I'm not sure if hiveql allows loops. I would prefer a hive solution but a SQL solution could work if I'm able to move my intermediate table to a relational database. A solution would be equivalent to the union query below which has the parameter values highlighted.

enter image description here

Any help with a solution is appreciated, thanks.

hghg hghg
  • 127
  • 9

1 Answers1

0

Create a parameter table and use joins. I'm not 100% sure which are the parameters and which are the columns, but something like this:

SELECT t1.col1, t1.col2, t1.new_time, sum(t1.col3),
       (case when "date" between a and b then 1
             when "date" between c and d then 2
        end) as date_group,
       (case when "date" < e then 'test1' else 'test2' end) as test_group,
      p.identifier as ID,
      p.identifier2 as ID2
FROM Table1 t1 CROSS JOIN
      params p
WHERE t1.new_time BETWEEN t1.time1 AND t1.time2 AND
      t1.tag BETWEEN p.lower AND p.upper
GROUP BY t1.col1, t1.col2, t1.new_time, 
      (case when p."date" between a and b then 1
            when p."date" between c and d then 2
       end),
      (case when p."date" < e then 'test' else 'test2' end),
      p.identifier, p.identifier2;
Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • Thanks I will give this a try I've updated my question to hopefully give clarity to what the parameter values are and how they change. – hghg hghg Mar 29 '20 at 19:05
  • Thank you for the great solution, it solves my current problem perfectly I'm curious if there are any ways to help it scale to larger data sets. – hghg hghg Mar 30 '20 at 23:35
  • @hghghghg . . . This should work pretty well if you have just a few parameters. – Gordon Linoff Mar 30 '20 at 23:58
  • will the runtime of the query be related to the number of parameters and the number of rows in the parameter table? – hghg hghg Mar 31 '20 at 00:03
  • @hghghghg . . . Yes. The `GROUP BY` will have somewhat poor scaling. It is hard to see how you get around that though. – Gordon Linoff Mar 31 '20 at 00:17