0

I have hundreds of thousands of rows of data (like the example below) which consists of account number and the dates of calls that were placed. I need to find all accounts where there were 3 or more calls placed within a 7 day period. So, for the data below it would return the rows for accounts 1111111 and 3333333.

account_number  call_date
1111111          1/1/2010
1111111          1/3/2010
1111111          1/3/2010
1111111          1/5/2010
2222222          1/10/2010
2222222          1/20/2010
3333333          1/7/2010
3333333          1/7/2010
3333333          1/7/2010

There's probably a simple solution but for some reason I'm drawing a blank. Any help would be appreciated.

UncleCross
  • 33
  • 3

4 Answers4

1

Here is an alternative answer using a data step and the lag function and a by group.

proc sort data=have;
by acc_nbr call_date;
run;

data want(drop=count);
set have;
by acc_nbr call_date;
d2 = lag2(call_date);
format d2 date9.;
if first.acc_nbr then count = 1;
if count < 3 then d2 = .;
count + 1;
diff = call_date - d2;
if diff <= 7 and diff ne . then output;
run;

All this is doing is using the lag2 function, which returns the value of call_date from two records before the current record. Next, we say if it's the first account number then set count = 1. Next, we set the value d2 to missing if the count is less than 3. This ensures that we don't pick up a date value from a separate by group. Last, we just check to make sure the difference between call_date and d2 is less than or equal to 7 and output that record.

This will output duplicate account numbers if there are multiple cases where a customer made over 3 calls in a 7 day period more than once so if you want unique values you can use proc sql, select distinct or proc sort nodupkey.

J_Lard
  • 1,083
  • 6
  • 18
1

Interesting problem. Below is what I think I would do:

Input data. For each call_date, calculate the date 7 days before.

data have;
    input account_number : $char20.
          call_date : mmddyy10.;
    date_7_days_before = intnx('DAY',call_date,-7);

datalines;
1111111 1/1/2010
1111111 1/3/2010
1111111 1/3/2010
1111111 1/5/2010
2222222 1/10/2010
2222222 1/20/2010
3333333 1/7/2010
3333333 1/7/2010
3333333 1/7/2010
;
run;

Consolidate calls Intra-day. Sum number of calls by day per account.

proc sql;
    create table consolidate_calls as
    select account_number,
           call_date,
           date_7_days_before,
           count(*) as calls
   from have
   group by account_number,
            call_date,
            date_7_days_before;
quit;

Self join consolidated_calls. If the b.call_date is within the date range of a.date_7_days_before and a.call_date then group together.

proc sql;
    create table want as
    select a.account_number,
           a.call_date format=mmddyy10.,
           a.calls as calls_that_day,
           sum(b.calls) as calls_last_7_days /*Number of calls on, or within 7 days of, the call_date*/
    from consolidate_calls as a
    left join consolidate_calls as b
    on a.account_number = b.account_number
    and b.call_date <= a.call_date 
    and b.call_date >= a.date_7_days_before
    group by a.account_number,
             a.call_date,
             a.calls
    ;
quit;

The self-join methodology is a little hard to explain for me, but I believe this should work.

Hugs
  • 543
  • 2
  • 8
0

Try this, it looks like the solution generates results properly.

Preparing Data

data a;
    input account_number call_date :mmddyy.;
    format call_date ddmmyy10.;
datalines;
1 1/1/2010
1 1/3/2010
1 1/3/2010
1 1/5/2010
2 1/10/2010
2 1/20/2010
3 1/7/2010
3 1/7/2010
3 1/7/2010
;
run;

Solution

proc sql;
    select distinct a1.account_number
    from a a1, a a2, a a3
    where a1.account_number = a2.account_number 
        and a1.account_number = a3.account_number
        and a3.call_date > a2.call_date 
        and a2.call_date > a1.call_date 
        and a3.call_date < a1.call_date + 7;
quit;
Robert Soszyński
  • 1,178
  • 1
  • 7
  • 12
0

Similar to @J_Lard's approach, you can use the DIF function to compute the difference between a variable and its lag. This simplifies the code a bit. If account number is numeric, you can find the records of interest by testing dif2(id)=0 and dif2(calldate)<=7 That checks that the ID is the same as two records before and that calldate is within 7 days. If account number is not numeric, could use id=lag2(id) and dif2(calldate)<=7 . Of course this assumes data have already been sorted.

Below code adds a _DONE Boolean, used to track whether an ID has already been output.

data want (keep=id);
  set have;
  by id;
  retain _done; *boolean to flag when an id has already been output;
  if first.id then _done=0;

  if _done=0 and dif2(id)=0 and dif2(calldate) <= 7 then do;
    output;
    _done=1;
  end;
run;
Quentin
  • 5,960
  • 1
  • 13
  • 21