0

I have some data about when, how long, and what channel people are listening to the radio. I need to make a variable called sessions that groups all entries which occur while the radio is on. Because the data may contain some errors I would like to say that if less than five minutes passes from the end of one channel period to the next then it is still the same session. Hopefully a brief example will clarify.

  obs  Entry_date   Entry_time  duration(in secs) channel
   1    01/01/12      23:25:21    6000               2
   2    01/03/12      01:05:64     300               5
   3    01/05/12      12:12:35     456               5
   4    01/05/12      16:45:21     657               8

I want to create the variable sessions so that

obs  Entry_date   Entry_time  duration(in secs) channel   session
   1    01/01/12      23:25:21    6000               2    1
   2    01/03/12      01:05:64     300               5    1
   3    01/05/12      12:12:35     456               5    2
   4    01/05/12      16:45:21     657               8    3

for defining 1 session i need to use entry_time (and date if it goes from 11pm into the next morning) so that if entry_time+duration + (5minutes) < entry_time(next channel) then the session changes. This has been killing me and simple arrays wont do the trick, or my attempt using arrays has not worked. Thanks in advance

PeterClemmensen
  • 4,018
  • 3
  • 14
  • 22
user2448666
  • 329
  • 1
  • 6
  • 14

2 Answers2

1

hopefully I got your requirements right! Since you need to base result on adjoining rows, there is a need to join a table to itself. The Session #s are not consecutive, but you should get the point.

 create table #temp
 (obs int not null,
entry_date datetime not null,
duration int not null,
channel int not null)


--obs  Entry_date   Entry_time  duration(in secs) channel
insert #temp
select   1, '01/01/12 23:25:21', 6000, 2
 union all select 2, '01/03/12 01:05:54', 300, 5
 union all select 3, '01/05/12 12:12:35', 456, 5
 union all select 4, '01/05/12 16:45:21', 657, 8

select a.obs,
       a.entry_date,
       a.duration,
endSession = dateadd(mi,5,dateadd(mi,a.duration,a.entry_date)),
a.channel,
b.entry_date,
minOverlapping = datediff(mi,b.entry_date,
                          dateadd(mi,5,dateadd(mi,a.duration,a.entry_date))),
anotherSession = case 
          when dateadd(mi,5,dateadd(mi,a.duration,a.entry_date))<b.entry_date
    then b.obs
    else a.obs end
from #temp a
  left join #temp b on a.obs = b.obs - 1

hope this helps a bit

Jay Corbett
  • 28,091
  • 21
  • 57
  • 74
user2065377
  • 448
  • 3
  • 12
  • this has to be performed on a massive data set, 1million + rows, so does the first select statement imply I will need to enter all the information there – user2448666 Sep 10 '13 at 18:52
  • no. I created a temporary table and inserted sample data just to test the query. Substitute #temp with whatever table name you have in mind. The mill or so rows shouldn't be a big problem assuming that the proper indexing is applied. just check the execution plan for that. – user2065377 Sep 10 '13 at 18:56
1

Aside from the comments I made in the OP, here's how I would do it using a SAS data step. I've changed the date and time values for row 2 to what I suspect they should be (in order to get the same result as in the OP). This avoids having to perform a self join, which is likely to be performance intensive on a large dataset.
I've used the DIF and LAG functions, so care needs to be taken if you're adding in extra code (particularly IF statements).

data have;
input entry_date :mmddyy10. entry_time :time. duration channel;
format entry_date date9. entry_time time.;
datalines;
01/01/2012 23:25:21 6000 2
01/02/2012 01:05:54 300 5
01/05/2012 12:12:35 456 5
01/05/2012 16:45:21 657 8
;
run;

data want;
set have;
by entry_date entry_time; /* put in to check data is sorted correctly */
retain session 1; /* initialise session with value 1 */
session+(dif(dhms(entry_date,0,0,entry_time))-lag(duration)>300); /* increment session by 1 if time difference > 5 minutes */
run;
Longfish
  • 7,582
  • 13
  • 19
  • There are some inconsistencies in the data which is why i want to give it the five minute window. I did not mention it earlier, because i thought it would be a simple by statement, but i want this done for each unique ID, and as is sessions never restarts at 1 for each id. SHould I edit and repost – user2448666 Sep 13 '13 at 18:39