1

I have a list of web browsing data that I'm attempting to convert into a sessions.

An example dataset from a user:

time_millis Type Result  
07/10/2015 08:31    1   0  
07/10/2015 08:41    1   0  
07/10/2015 08:48    2   0  
07/10/2015 08:50    2   0  
07/10/2015 09:11    2   1  
07/10/2015 09:14    3   0  
07/10/2015 09:15    1   0  
07/10/2015 09:17    1   0  
07/10/2015 10:31    1   0  
07/10/2015 10:33    1   0  
07/10/2015 10:36    1   0  
07/10/2015 18:57    1   1  

I am trying to achieve:

time_millis Type    Result  Session  
07/10/2015 08:31    1   0   1  
07/10/2015 08:41    1   0   1  
07/10/2015 08:48    2   0   2  
07/10/2015 08:50    2   0   2  
07/10/2015 09:11    2   1   3  
07/10/2015 09:14    3   0   4  
07/10/2015 09:15    1   0   4  
07/10/2015 09:17    1   0   4  
07/10/2015 10:31    1   0   5  
07/10/2015 10:33    1   0   5  
07/10/2015 10:36    1   0   5  
07/10/2015 18:57    1   1   6  

I'd like to partition the sessions by a 15 minute window per user and also create a new session if the 'Type' is changed.

I've tried using the following bit of code, though I don't think it performs as I'd like above:

CONDITIONAL_TRUE_EVENT("time_millis" > coalesce(lag("time_millis"), 1) 
+ 900000 OR type != LAG(type,1)) OVER (PARTITION BY user ORDER BY   
time_millis) AS session_id
Wai Ha Lee
  • 8,598
  • 83
  • 57
  • 92

1 Answers1

1

You could bucket the time into four 15 minute buckets and compare the current bucket to the last bucket in your event check.

Trunc the time to minutes, subtract the time mod 15 minutes to place it in a 15 minute bucket. Originally I thought just to do the div only on the minute but if you have an hour gap it would break, so you need to include the full timestamp.

TRUNC("time_millis",'MI') - (EXTRACT(MINUTE FROM "time_millis") % 15) * interval '1 minute'

In your formula...

CONDITIONAL_TRUE_EVENT("time_millis" > coalesce(lag("time_millis"), 1) + 900000 
    OR type != LAG(type,1) 
    OR TRUNC("time_millis",'MI') - (EXTRACT(MINUTE FROM "time_millis") % 15) * interval '1 minute' != TRUNC(LAG("time_millis",1),'MI') - (EXTRACT(MINUTE FROM LAG("time_millis",1)) % 15) * interval '1 minute') 
OVER (PARTITION BY user ORDER BY time_millis) AS session_id

Another idea would be to only change the event if the last event was more than 15 minutes from the current event. It really depends on what you want to do, but I thought this might be useful too. Take the diff, extract the epoch (number of seconds total) divide by 60 to get minutes.

EXTRACT(EPOCH FROM ("time_millis" - LAG("time_millis",1)))/60 > 15

In your formula...

CONDITIONAL_TRUE_EVENT("time_millis" > coalesce(lag("time_millis"), 1) + 900000 
    OR type != LAG(type,1) 
    OR EXTRACT(EPOCH FROM ("time_millis" - LAG("time_millis",1)))/60 > 15) 
OVER (PARTITION BY user ORDER BY time_millis) AS session_id
woot
  • 7,406
  • 2
  • 36
  • 55
  • This seems to create a syntax error at the OVER part of the query, any ideas: specifically: TRUNC(LAG("time_millis",1),'MI') - (EXTRACT(MINUTE FROM LAG("time_millis",1)) % 15) * interval '1 minute' OVER (PARTITION BY user ORDER BY time_millis) AS session_id –  Oct 21 '15 at 09:18
  • I was missing a parenthesis before `OVER`. I fixed it. – woot Oct 21 '15 at 14:33
  • @user3937831 I added another method for the 15 minutes if you want it to stream along. This method would keep a session grouped if you had only 1 event without a change in type every 15 minutes. – woot Oct 22 '15 at 14:54