0

guys, I have a big query result which shows me the time (in the column local_time) that riders (in the column rider_id) logout of an app (the column event), so there are two distinct values for the column event, "authentication_complete" and "logout".

event_date  rider_id    event                    local_time
20200329    100695      authentication_complete  20:07:09
20200329    100884      authentication_complete  12:00:51
20200329    100967      logout                   10:53:17
20200329    100967      authentication_complete  10:55:24
20200329    100967      logout                   11:03:28
20200329    100967      authentication_complete  11:03:47
20200329    101252      authentication_complete  7:55:21
20200329    101940      authentication_complete  8:58:44
20200329    101940      authentication_complete  17:19:57
20200329    102015      authentication_complete  14:20:27
20200329    102015      logout                   22:47:50
20200329    102015      authentication_complete  22:48:34

what I want to achieve is for each rider who ever logged out, in one column I want to get the time they logged out, and in another column I want to get the time for the event "authentication_complete" that comes right after that logout event for that rider. In this way, I can see the time period that each rider was out of the app. the query result I want to get will look like below.

event_date  rider_id    time_of_logout  authentication_complete_right_after_the_logout
20200329    100967      10:53:17        10:55:24
20200329    100967      11:03:28        11:03:47
20200329    102015      22:47:50        22:48:34

This was a very unclean data set, and so far I was able to clean this much, but at this step, I am feeling very stuck. I was looking into functions like lag() but since the data is 180,000 rows, there can be multiple events named "logout" for a rider_id and there are multiple consecutive events named "authentication_complete" for the same rider_id, it is extra confusing. I would really appreciate any help. Thanks!

  • What if the next event is not authentication? – Gordon Linoff Mar 31 '20 at 21:54
  • the next event is almost always "authentication_complete". There were very few cases where the next event after a "logout" was again "logout", but that is party due to incomplete data and in that case, we can just look at the next earliest event that is "authentication_complete" or just ignore it if it is too troublesome – Savybossman Mar 31 '20 at 22:32

1 Answers1

0

I think you want lead():

select event_date, rider_id, date, local_time as logout_date,
       authentication_date
from (select t.*,
             lead(local_time) over (partition by event_date, rider_id order by local_time) as authentication_date
      from t
     ) t
where event = 'logout';

This assumes that the next event is indeed an authentication, as in your sample data. You don't specify what to do if this is not the case.

If you specifically want the next authentication date, then you can use a min():

select event_date, rider_id, date, local_time as logout_date,
       authentication_date
from (select t.*,
             min(case when event = 'authentication_complete' then local_time end) over (partition by event_date, rider_id order by local_time desc) as authentication_date
      from t
     ) t
where event = 'logout';
Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786