2

I'm working with SAS and I have a data frame like this:

Table1:

+------+------------+-----------+--------+
| name |    date    |    time   |  price |
+------+------------+-----------+--------+
| A    |  7-May-08  |  11:12:41 |      1 |
| A    |  11-Jul-08 |  11:23:41 |      2 |
| A    |  3-Jan-09  |  11:31:41 |      1 |
| A    |  4-Jan-09  |  11:32:41 |      2 |
| A    |  4-Jan-09  |  11:32:41 |      2 |
| A    |  8-Jul-09  |  11:32:41 |      1 |
| A    |  8-Jul-09  |  11:32:41 |      2 |
| A    |  24-Jul-09 |  11:32:41 |      3 |
| A    | 24-Jul-09  |  11:32:41 |      4 |
| A    |  8-Dec-09  |  12:32:41 |      1 |
| B    |  7-May-08  |  11:31:41 |      2 |
| B    |  10-May-08 |  11:32:41 |      3 |
| B    |  17-May-08 |  11:33:41 |      4 |
| B    |  24-May-08 |  11:34:41 |      1 |
| B    |  1-Jun-08  |  11:35:41 |      5 |
| B    |  18-Jun-08 |  11:36:41 |      1 |
| B    |  9-May-09  |  11:37:41 |      3 |
| C    |  7-Oct-09  |  11:21:41 |      3 |
| C    |  17-Oct-09 |  11:22:41 |      2 |
| C    |  25-Oct-09 |  11:32:41 |      1 |
| C    |  18-Nov-09 |  11:33:41 |      3 |
| C    |  4-Dec-09  |  11:12:41 |      4 |
| C    |  19-Dec-09 |  10:22:41 |      1 |
| C    |  9-May-10  |  11:42:41 |      3 |
| C    |  9-May-10  |  11:12:41 |      1 |
| C    |  10-May-10 |  12:52:41 |      2 |
+------+------------+-----------+--------+

I have another data frame like this:

table2:

+------+-----------+
| name |    date   |
+------+-----------+
| A    | 11-Jul-08 |
| A    | 3-Jan-09  |
| A    | 24-Jul-09 |
| B    | 7-May-08  |
| B    | 17-May-08 |
| B    | 18-Jun-08 |
| B    | 9-Jul-09  |
| C    | 17-Oct-09 |
| C    | 4-Dec-09  |
| C    | 19-Dec-09 |
+------+-----------+

Now I want to do two main operations:

1- If date and name in table2 is in table1 then delete that row in table1;

2- If the previous step happened then delete the next row of that name and date and if name and date of the next row is repeated for other next rows then delete all of them.

For example table1 at last should be like this:

+------+-----------+----------+-------+
| name |   date    |   time   | price |
+------+-----------+----------+-------+
| A    | 7-May-08  | 11:12:41 |     1 |
| A    | 8-Jul-09  | 11:32:41 |     1 |
| A    | 8-Jul-09  | 11:32:41 |     1 |
| B    | 1-Jun-08  | 11:35:41 |     5 |
| C    | 7-Oct-09  | 11:21:41 |     3 |
| C    | 18-Nov-09 | 11:33:41 |     3 |
| C    | 10-May-10 | 12:52:41 |     2 |
+------+-----------+----------+-------+

Here is a code which is not perfect for this operation because of two reasons:

1- Using the nodupkey option delete all duplicate observations from table1 which is not necessary. Because delete them is happened when conditions, which is described above, is satisfied.

2- And the "(inb=0 and lag(inb)=1 and not first.name)" statement delete just one next rows and the other next rows whit same name and date still be in table1.

proc sort data=table1 out=tablea1 nodupkey;
by name date;
run;
proc sort data=table2 out=tableb1 nodupkey;
by name date;
run;
data want;
 merge  tablea1 tableb1(in=inb) ;
 by name date;
 if inb or (inb=0 and lag(inb)=1 and not first.name) then delete;
run;

Thanks in advance.

ary
  • 151
  • 1
  • 2
  • 14
  • Can you explain the rule in another way? Why did you eliminate the last observation for name='A'? What do you mean by "next row"? Do you mean rows in TABLE1 that have dates that are equal to the day AFTER the date from table2? – Tom Nov 28 '17 at 14:19

3 Answers3

1

Armin:

In a complex merge and process operation you will need some additional variables for maintaining the state of your business rules. The case of deleting the next row of a match and duplicates thereof requires tracking of the next name and date.

For example:

data have;
  input @;
  if _infile_ ne: '+';
  attrib 
    name length=$10
    date length=4 informat=date9. format=date11. 
    time length=4 informat=time8. format=time8. 
    price length=8
  ;
  infile cards dlm='|' firstobs=4;
  input @1 name date time price;
datalines;
+------+------------+-----------+--------+
| name |    date    |    time   |  price |
+------+------------+-----------+--------+
| A    |  7-May-08  |  11:12:41 |      1 |
| A    |  11-Jul-08 |  11:23:41 |      2 |
| A    |  3-Jan-09  |  11:31:41 |      1 |
| A    |  4-Jan-09  |  11:32:41 |      2 |
| A    |  4-Jan-09  |  11:32:41 |      2 |
| A    |  8-Jul-09  |  11:32:41 |      1 |
| A    |  8-Jul-09  |  11:32:41 |      2 |
| A    |  24-Jul-09 |  11:32:41 |      3 |
| A    | 24-Jul-09  |  11:32:41 |      4 |
| A    |  8-Dec-09  |  12:32:41 |      1 |
| B    |  7-May-08  |  11:31:41 |      2 |
| B    |  10-May-08 |  11:32:41 |      3 |
| B    |  17-May-08 |  11:33:41 |      4 |
| B    |  24-May-08 |  11:34:41 |      1 |
| B    |  1-Jun-08  |  11:35:41 |      5 |
| B    |  18-Jun-08 |  11:36:41 |      1 |
| B    |  9-May-09  |  11:37:41 |      3 |
| C    |  7-Oct-09  |  11:21:41 |      3 |
| C    |  17-Oct-09 |  11:22:41 |      2 |
| C    |  25-Oct-09 |  11:32:41 |      1 |
| C    |  18-Nov-09 |  11:33:41 |      3 |
| C    |  4-Dec-09  |  11:12:41 |      4 |
| C    |  19-Dec-09 |  10:22:41 |      1 |
| C    |  9-May-10  |  11:42:41 |      3 |
| C    |  9-May-10  |  11:12:41 |      1 |
| C    |  10-May-10 |  12:52:41 |      2 |
+------+------------+-----------+--------+
;
data filter;
  input @;
  if _infile_ ne: '+';
  attrib 
    name length=$10
    date length=4 informat=date9. format=date11.
  ;
  infile cards dlm='|' firstobs=4;
  input @1 name date;
datalines;
+------+-----------+
| name |    date   |
+------+-----------+
| A    | 11-Jul-08 |
| A    | 3-Jan-09  |
| A    | 24-Jul-09 |
| B    | 7-May-08  |
| B    | 17-May-08 |
| B    | 18-Jun-08 |
| B    | 9-Jul-09  |
| C    | 17-Oct-09 |
| C    | 4-Dec-09  |
| C    | 19-Dec-09 |
+------+-----------+
;
run;

data want (keep=name date time price);
  merge have(in=_have) filter(in=_filter);
  by name date;

  length match_at_n 4 next_name $10 next_date 4;
  retain match_at_n next_name next_date;

  if first.name then /* prevent delete next from sloshing into next group */
    match_at_n = -1;

  if _have and _filter then do;
    match_at_n = _n_;
    delete;
  end;

  if _filter then 
    delete;

  * condition here is _have and _not filter;

  if _n_ = match_at_n + 1 then do;
    next_name = name;
    next_date = date;
    delete;
  end;

  if name = next_name and date = next_date then
    delete;
run;

Suppose the same outcome could be achieved with a single complex compound if statement that involved a variety of lags, flags and sums -- regardless, I would favor clarity over cleverness.

Richard
  • 25,390
  • 3
  • 25
  • 38
0

Based on Ksharp's code in SAS community:

data temp;
 set table2(in=inb) table1;
 by name date;
 group+first.date;
 _inb=inb;
run;
data key;
 set temp(where=(_inb=1));
 output;
 group=group+1;
 output;
 keep name group;
run;
proc sql;
create table want as
select name, date, time, price
 from temp 
  where catx(' ',name,group) not in
   (select catx(' ',name,group) from key);
quit;
Amin Karimi
  • 407
  • 2
  • 16
-1
data want;
    merge table1 table2(in=inb);
    by name date;
    retain _date num;
    if first.name then call missing(_date,num);
    if inb then do;
       num=_n_;
       delete;
    end;
    else if _n_-num=1 then do;
       _date=date;
       delete;
    end;
    else if _date=date then delete;
    drop _date num;
run;
Shenglin Chen
  • 4,504
  • 11
  • 11
  • Hi; while this code may answer the question, on this site we prefer answers that explain (in plain english) what the answer is, even more so than actual code. A good answer will likely have both, but it should at least have a textual explanation. – Joe Nov 28 '17 at 15:36