0

I have following HQL

SELECT count(*) OVER (PARTITION BY identity.hwid, passwordused.domain ORDER BY event.time ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING) pocet,
       min(event.time) OVER (PARTITION BY identity.hwid, passwordused.domain ORDER BY event.time ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING) minTime,
       max(event.time) OVER (PARTITION BY identity.hwid, passwordused.domain ORDER BY event.time ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING) maxTime 
FROM t21_pam6

How can I define the 3 same WINDOW clauses into one?

The documentation (https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics ) shows this example

SELECT a, SUM(b) OVER w
FROM T;
WINDOW w AS (PARTITION BY c ORDER BY d ROWS UNBOUNDED PRECEDING)

But I don't think it's working. It's not possible to define WINDOW w as... is not a HQL command.

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
Vity
  • 241
  • 3
  • 15
  • 1
    Just to answer your original question - it will be working if you remove the semicolon after `FROM T`. WINDOW clause is not a statement but a part of the query. Probably, there was a typo in the documentation at that moment. – Sergey Khudyakov Aug 06 '17 at 02:09

2 Answers2

1

This type of optimization is something that the compiler would need to do. I don't think there is a way to ensure this programmatically.

That said, the calculation for the minimum time is totally unnecessary. Because of the order by, it should be the time in the current row. Similarly, if you can handle null values, then the expression can be simplified to:

SELECT count(*) OVER (PARTITION BY identity.hwid, passwordused.domain ORDER BY event.time ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING) pocet,
       event.time as minTime,
       lead(event.time, 2) OVER (PARTITION BY identity.hwid, passwordused.domain ORDER BY event.time) as maxTime 
FROM t21_pam6;

Note that the maxtime calculation is slightly different because it will return NULL for the last two values matching the conditions.

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • Thanks for the hint on optimization of minTime/maxTime. From my point of view the optimization engine is working correctly and the count of map/reduce steps is OK, but I would like to make also more readable for another programmer - to extract the repetitive part into single one. – Vity Jun 14 '16 at 14:45
0

As @sergey-khudyakov responded, there was a bug in documentation. This variant works fine:

  SELECT count(*) OVER w,
          min(event.time) OVER w,
          max(event.time) OVER w 
   FROM ar3.t21_pam6
   WINDOW w AS (PARTITION BY identity.hwid, passwordused.domain ORDER BY      event.time ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING)
Vity
  • 241
  • 3
  • 15