1

I've got two tables with similar structure:
- First table: id and col1,col2,col3 - all numerics.
- Second table: id and col4,col5,col6 - all numerics.

I want to remove from the first one all rows which are similar to any of the rows from the second tagble. I consider a row to be similiar to other row when any column from the group col1-col3 is equal to any of the columns from the group col4-col6. Now I'm doing it in 9 consecutive data steps (first checks whether col1=col4, second col1=col5 , ..., ninth col3=col6), which probably is not the optimal solution.

Any ideas how to improve this?

Tomek Tarczynski
  • 2,785
  • 8
  • 37
  • 43

2 Answers2

2

This is my solution:

data vec1;
  set ds2;
  array cvar{*} col4 col5 col6;
  do ijk=1 to dim(cvar);
    compvar=cvar(ijk);
    output;
  end;
run;

proc sql noprint;
  select distinct compvar into :cvars separated by ' '
  from vec1;
quit;
%let numcvar=&sqlobs;

data ds1(drop=i);
  set ds1;
  array myvar(i) col:;
  do over myvar;
    if myvar in (&cvars.) then delete;
  end;
run;

If you run into trouble with the length of the CVARS macro variable you could use this instead:

data vec1;
  set ds2;
  array cvar{*} col:;
  do ijk=1 to dim(cvar);
    compvar=cvar(ijk);
    output;
  end;
run;

proc sort data=vec1 out=vec2(keep=compvar) nodupkey;
  by compvar;
run;

proc transpose data=vec2 out=flat prefix=x;
run;

data ds1(keep=id col:);
  set ds1b;
  if _n_=1 then set flat;
  array myvar(i) col:;
  array xvar(j) x:;
  do over myvar;
    do over xvar;
      if myvar=xvar then delete;
    end;
  end;
run;

The PROC SORT could be eliminated but it makes it more efficient for big data sets.

Or you could generate a format on the fly:

data vec1;
  set ds2;
  array cvar{*} col4 col5 col6;
  do ijk=1 to dim(cvar);
    compvar=cvar(ijk);
    output;
  end;
run;

proc sort data=vec1 out=vec2 nodupkey;
  by compvar;
run;

data fmt1;
  set vec2;
  length start $20;
  fmtname="remobs";
  start=compress(put(compvar,best.));
  label="remove";
run;

proc format lib=work cntlin=fmt1;
run;

data ds1(drop=i);
  set ds1;
  array myvar(i) col:;
  do over myvar;
    if put(myvar,remobs.)="remove" then delete;
  end;
run;

I suspect this last method would be faster than the two preceding solutions.

UPDATE

Using hash objects

data vec1;
  set ds2;
  array cvar{*} col4 col5 col6;
  do ijk=1 to dim(cvar);
    compvar=cvar(ijk);
    output;
  end;
run;

proc sort data=vec1 out=vec2 nodupkey;
  by compvar;
run;

data ds1_new(keep=id col1 col2 col3);
  if _n_ = 0 then set work.vec2;
  declare hash myhash(DATASET:'work.vec2') ; 
  rc=myhash.defineKey('compvar'); 
  myhash.defineDone();
  set ds1;
  array rcarr{*} rc1-rc3;
  array lookup{*} col1 col2 col3;
  do i=1 to dim(lookup);
    rcarr(i)=myhash.find(key: lookup(i));
    if rcarr(i)=0 then delete;
  end;
run;
DavB
  • 1,676
  • 2
  • 14
  • 16
  • To be honest I was a bit confused when I read your last method for the first time. I've never seen such use of formats, but SAS is sometimes so non-intuitive that I wouldn't be suprised if that was really the fastest method. Anyway +1 and if there won't be any better solution in a few days then I will accept your answer. Thanks! – Tomek Tarczynski Mar 21 '12 at 13:32
  • Wouldn't hash tables be faster than your solution? – Tomek Tarczynski Mar 23 '12 at 10:11
  • @TomekTarczynski - I'm not sure. I've not actually used them before now...so thanks for the use case! I added a hash object solution (which seems to work for me). You might report back on your findings regarding the speed of different methods. – DavB Mar 23 '12 at 13:54
1

ok, 2nd attempt to answer this. I've created a cartesian join of the 2 datasets in order to match every row in table 1 with every row in table 2. You can then use the arrays to find out which rows have repeat values.

data ds1;
input id col1 col2 col3;
cards;
1   10  20  30
2   40  50  60
3   70  80  90
4   15  25  35
5   45  55  65
;
run;

data ds2;
input id col4 col5 col6;
cards;
10  100 200 300
12  60  50  600
13  700 800 70
16  15  20  300
;
run;

proc sql;
create view all_cols as select
ds1.id as id1, ds2.id as id2,* from ds1,ds2;
quit;

data match;
set all_cols (keep=id1 id2 col:);
array vars1{*} col1-col3;
array vars2{*} col4-col6;
do i=1 to dim(vars1);
do j=1 to dim(vars2);
    if vars1{i}=vars2{j} then do;
    output;
    return;
    end;
end;
end;
drop i j;
run;

proc sort data=match;
by id1;
run;

data ds1;
modify ds1 match (in=b keep=id1 rename=(id1=id));
by id;
if b then remove;
run;
Longfish
  • 7,582
  • 13
  • 19
  • Maybe I wasn't precise enought, but there might be the case that a row with id=1 from the first table is similar to a record with id=145 from the second table. The ID column in one table might have totally different values than the ID column in the second table. It is just some additional info to be able to identify which record is similar to which. – Tomek Tarczynski Mar 20 '12 at 11:43
  • Apologies Tomek, I didn't read your question properly. I've revised my answer which hopefully does what you want now. – Longfish Mar 20 '12 at 12:23
  • This works, but if I'm not mistaken it is extremely slow. If the first table consists of 10000 rows and so do second one then view all_cols will consist of 10^8 rows. – Tomek Tarczynski Mar 20 '12 at 13:49