0

I have a large dataset where I want to only keep instances where one variable (frompc) can be matched somewhere within the dataset to another variable (topc) and vice versa where topc can be matched within the dataset to frompc. It is very tedious to do this in excel so i am lookng for a way it can done in SAS, possibly using Proc SQL or merge.

|   frompc       |    topc      |
+----------------+--------------+
| F19913344567   | M19800674560 |
| F19232342208   | F19451367689 |
| M20011105436   | F19232342208 |
| F20220054321   | M19762152621 |

In this case i would like rows 2 and 3 kept but rows 1 and 4 removed.

I have tried different forms of SQL but keep getting results that include records that don't match or even excluding records that do match.

Richard
  • 25,390
  • 3
  • 25
  • 38

2 Answers2

0

Try this

data have;
input (frompc topc)(:$12.);
datalines;
F19913344567 M19800674560 
F19232342208 F19451367689 
M20011105436 F19232342208 
F20220054321 M19762152621 
;

data want;

   if _N_ = 1 then do;
      dcl hash h1(dataset : 'have');
      h1.definekey('frompc');
      h1.definedone();

      dcl hash h2(dataset : 'have');
      h2.definekey('topc');
      h2.definedone();
   end;

   set have;

   if h1.check(key : topc) = 0 | h2.check(key : frompc) = 0;

run;
PeterClemmensen
  • 4,018
  • 3
  • 14
  • 22
0

One SQL query would be thus:

data have;
  call streaminit(20230302);
  do _n_ = 1 to 1000;
    one = rand('integer',1e5);
    two = rand('integer',1e5);
    output;
  end;
run;

proc sql;
  create table want as
  select one, two
  from have 
  where 
     one in (select two from have)
  or two in (select one from have)
;

If you can understand why an or is coded instead of an and you will have a better understanding of your problem.

Richard
  • 25,390
  • 3
  • 25
  • 38