0

I need to write a query in Hive or define a function that needs to do the followings:

The dataset:

Student || Time    || ComuputerPool
-------------------------------------
  A     ||  9:15AM ||  Pool1.Machine2
-------------------------------------
  A     ||  9:45AM ||  Pool1.Machine7
-------------------------------------
  A     ||  10:15AM||  Pool1.Machine9
-------------------------------------
  A     ||  11:00AM||  Pool2.Machine2
-------------------------------------
  A     ||  12:05  ||  Pool2.Machine3
-------------------------------------
  A     ||  12:40  ||  Pool3.Machine5
------------------------------------- 
  A     ||  13:10  ||  Pool1.Machine3
-------------------------------------
  A     ||  13:50  ||  Pool1.Machine10
-------------------------------------
  B   ..........................

so now the query should find out how long a particular student has spent in a particular computer pool by calculating the difference of when he first used a machine in a pool and when he first start using a machine in another pool. So this example the time he spent would be the difference of : 11:00AM - 9:15AM = 1Hour45Mins

My question here is how am I going to mark the first use in one store time value and use it later when I find the next pool data.

Dilshad Abduwali
  • 1,388
  • 7
  • 26
  • 47

1 Answers1

1

Hive supports row_number(), so you can use a trick for grouping sequential values. The difference between two row numbers defines a group, which can then be used for aggregation. The resulting query looks like:

select student, grp, min(time), max(time)
from (select t.*,
             (row_number() over (partition by student order by time) -
              row_number() over (partition by student, computerpool order by time)
             ) as grp
      from dataset t
     ) t
group by student, grp, computerpool;

EDIT:

Your question is about pools not machines. So, you only care about the part before the period, presumably. (By the way, you should store this information in separate columns).

select student, substr(computerpool, 1, 6), min(time), max(time)
from (select t.*,
             (row_number() over (partition by student order by time) -
              row_number() over (partition by student, substr(computerpool, 1, 6) order by time)
             ) as grp
      from dataset t
     ) t
group by student, grp, substr(computerpool, 1, 6);
Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • its grouping by pools even though they were in different time, because it should consider the pool as different as soon as he lease it and consider next visit as new group – Dilshad Abduwali May 30 '15 at 08:40
  • @dhssa . . . I think that is what this query does. – Gordon Linoff May 30 '15 at 14:37
  • I ran it and it grouped the pools into one group regardless its time frame and numbered them from 1,2,.. . Expected: 1.visit of Pool1 the number 1,2,.. and second visit Pool1 again from 1, 2, ... – Dilshad Abduwali May 30 '15 at 14:47
  • question is edited: extra two rows are added so the query should not group first Pool1 and second Pool1 or should not number them all together rather separately – Dilshad Abduwali May 30 '15 at 14:51