0

Probably a simple question. I have a simple dataset with scheduled payment dates in it.

DATA INFORM2;
 INFORMAT previous_pmt_date scheduled_pmt_date MMDDYY10.;
 INPUT previous_pmt_date scheduled_pmt_date;
 FORMAT previous_pmt_date scheduled_pmt_date MMDDYYS10.;
DATALINES;
11/16/2015 12/16/2015
12/17/2015 01/16/2016
01/17/2016 02/16/2016

; 

What I'm trying to do is to create a binary latest row indicator. For example, If I wanted to know the latest row as of 1/31/2016 I'd want row 2 to be flagged as the latest row. What I had been doing before is finding out where 1/31/2016 is between the previous_pmt_date and the scheduled_pmt_date, but that isn't correct for my purposes. I'd like to do this in an data step as opposed to SQL subqueries. Any ideas?

Want:

previous_pmt_date scheduled_pmt_date latest_row_ind
11/16/2015        12/16/2015         0
12/17/2015        01/16/2016         1
01/17/2016        02/16/2016         0
davids12
  • 323
  • 5
  • 18
  • Any particular reason for the data step preference? I ask because SQL would be easier/shorter.... – Robert Penridge Feb 19 '16 at 20:53
  • I already have a larger data step that I'd like to integrate this into, but if you have some SQL suggestions I'd be happy to use those instead. – davids12 Feb 19 '16 at 20:55
  • Well, I guess there are some datastep solutions that may fit the bill... can we assume that the data is sorted by scheduled_pmt_date in ascending order? And that there is never overlapping date ranges for the given account (you aren't showing account but I'm assuming there is one). – Robert Penridge Feb 19 '16 at 20:58
  • Yes, that's correct. Sorted by ascending account and scheduled_pmt_date. No overlapping ranges for a given account. – davids12 Feb 19 '16 at 21:02
  • Does it has to be done in a single datastep. Or can another datastep can be used? And do I understand correctly date intervals [previous_pmt_date; scheduled_pmt_date] does not intersect. For given example yes, but for all the data? – K.I. Feb 19 '16 at 21:12
  • It could be multiple steps, that's fine. Yes you do understand it correctly -- there is no intersection between previous_pmt_date and scheduled_pmt_date for any single account. – davids12 Feb 19 '16 at 21:19

1 Answers1

1

Here's a solution that does it all in the single existing datastep without any additional sorting. First I'm going to modify your data slightly to include account as the solution really should take that into account as well:

DATA INFORM2;
 INFORMAT previous_pmt_date scheduled_pmt_date MMDDYY10.;
 INPUT account previous_pmt_date scheduled_pmt_date;
 FORMAT previous_pmt_date scheduled_pmt_date MMDDYYS10.;
DATALINES;
1 11/16/2015 12/16/2015
1 12/17/2015 01/16/2016
1 01/17/2016 02/16/2016
2 11/16/2015 12/16/2015
2 12/17/2015 01/16/2016
2 01/17/2016 02/16/2016

; 
run;

Specify a cutoff date:

%let cutoff_date = %sysfunc(mdy(1,31,2016));

This solution uses the approach from this question to save the variables in the next row of data, into the current row. You can drop the vars at the end if desired (I've commented out for the purposes of testing).

data want;
  set inform2 end=eof;
  by account scheduled_pmt_date; 

  recno = _n_ + 1;

  if not eof then do;
    set inform2 (keep=account previous_pmt_date scheduled_pmt_date
                 rename=(account            = next_account 
                         previous_pmt_date  = next_previous_pmt_date
                         scheduled_pmt_date = next_scheduled_pmt_date)
                ) point=recno;
  end;
  else do;
    call missing(next_account, next_previous_pmt_date, next_scheduled_pmt_date);
  end;

  select;
    when ( next_account eq account and next_scheduled_pmt_date gt &cutoff_date ) flag='a';
    when ( next_account ne account ) flag='b';
    otherwise flag = 'z';
  end;

  *drop next:;

run;

This approach works by using the current observation in the dataset (obtained via _n_) and adding 1 to it to get the next observation. We then use a second set statement with the point= option to load in that next observation and rename the variables at the same time so that they don't overwrite the current variables.

We then use some logic to flag the necessary records. I'm not 100% of the logic you require for your purposes, so I've provided some sample logic and used different flags to show which logic is being triggered.

Some notes...

The by statement isn't strictly necessary but I'm including it to (a) ensure that the data is sorted correctly, and (b) help future readers understand the intent of the datastep as some of the logic requires this sort order.

The call missing statement is simply there to clean up the log. SAS doesn't like it when you have variables that don't get assigned values, and this will happen on the very last observation so this is why we include this. Comment it out to see what happens.

The end=eof syntax basically creates a temporary variable called eof that has a value of 1 when we get to the last observation on that set statement. We simply use this to determine if we're at the last row or not.

Finally but very importantly, be sure to make sure you are keeping only the variables required when you load in the second dataset otherwise you will overwrite existing vars in the original data.

Community
  • 1
  • 1
Robert Penridge
  • 8,424
  • 2
  • 34
  • 55
  • I'd also suggest playing with the test data and the case statement to make sure you have all of your edge cases covered. – Robert Penridge Feb 19 '16 at 21:35
  • I don't think your approach is fully functioning. For example, if we add another observation to each account your approach is flagging the first occurance of account as "z", then the next 2 observations as "a", and finally the last observation as "b". The correct flag I want in this case occurs in row 2 for each account. – davids12 Feb 19 '16 at 21:54
  • @davids12 Yes I figured there would be additional complexities or mistakes in the case statement but I'll leave the logic to figure these out up to you (I can only guess at how to handle them). However, the basic structure of the solution is in place - you just need to update `case` statement logic to meet your needs. – Robert Penridge Feb 19 '16 at 22:03
  • @davids12 On a side-note, if there's an obvious error in my case logic based on your original question requirements feel free to edit/fix it. – Robert Penridge Feb 19 '16 at 22:07