1

I have a dataset that contains 250 variables. I think some rows may be exact duplicates. If I only had 3 variables, I could run this code to check for dupes:

proc sql;
  create table checkDupe as
  select count(*) as N, *
  from bigTable
  group by 1, 2, 3
  having N > 1;
quit;

However, with 250 variables, I don't want to type out group by 1, 2, 3, ... , 250.

The following group by statements don't work:

group by *

group by _ALL_

group by 1:250

Is there a concise way to group by all variables?

hossibley
  • 253
  • 5
  • 11

3 Answers3

3

Try to produce macro variable for column names of dataset, then group in sql;

proc sql;
   select name into: name separate ',' from dictionary.columns where libname='SASHELP' and memname='CLASS';
quit;

proc sql;
   select count(*) as N,* from sashelp.class group by &name;
quit;

If just want to know whether there are complete duplicates in dataset, try this:

proc sort data=sashelp.class out=want nodup;
by _all_;
run;
Shenglin Chen
  • 4,504
  • 11
  • 11
1

Your sample code checking for duplicates returns a dataset that has all of the records where the key is NOT unique. So the below returns four records:

data bigtable;
   input a b c d;
   datalines;
1 1 1 1
1 1 1 1
1 2 1 1
1 2 2 1
1 2 2 2
;
run;

proc sql;
  create table checkDupe1 as
  select *,count(*) as N
  from bigTable
  group by 1, 2, 3
  having N > 1;
quit;

Assuming you don't need the N column, they easiest way in SAS to get a dataset of all non-unique rows is to use PROC SORT with the NOUNIQUEKEY option. So below will return the same four records:

proc sort data=bigtable out=checkDupe2 nouniquekey ;
  by a b c ;
run ;

And the BY statement can be BY _ALL_, which will return the two records that are duplicates on all variables:

proc sort data=bigtable out=checkDupe3 nouniquekey ;
  by _all_ ;
run ;
Quentin
  • 5,960
  • 1
  • 13
  • 21
0

If the desired output is each unique row from a SAS data set, no grouping is required. The easiest way to execute this in PROC SQL is with the distinct keyword.

For example:

data one;
   input a b c d;
   datalines;
   1 1 1 1
   1 1 1 1
   1 2 1 1
   1 2 3 4
   1 2 1 1
   ;
   run;
proc sql;
   select distinct * from one;
   run;

...produces the following output, where the two duplicate rows are eliminated.

enter image description here

Len Greski
  • 10,505
  • 2
  • 22
  • 33
  • I don't think that's the goal. This doesn't identify duplicates. – Quentin May 12 '18 at 22:54
  • @Quentin - it was difficult to tell from the OP what the desired output was. Either s/he wants to either eliminate the duplicates or know the duplicates. One answer already showed how to find the duplicates, so I posted an answer about eliminating them, starting my answer with "If the desired output is each unique row from a SAS data set..." – Len Greski May 12 '18 at 23:32
  • @LenGreski @Quentin Hey folks. Main goal was to identify which rows were exact duplicates. Having the count(*) variable was also useful for knowing exactly how many times a row was duplicated. Shenglin's suggestion of assigning variable names to a macro variable works, though I'm a little surprised there isn't a way to use `group by _all_` or something similar like one can do in PROC SORT. Thanks. – hossibley May 14 '18 at 20:18
  • @hossibley - thanks for the feedback. There's no `_all_` in PROC SQL because the language is based on Structured Query Language, not the base SAS language. PROC SQL was originally designed to pass SQL statements to a relational database via SAS/ACCESS, and the remote relational databases do not know how to process `_all_`. – Len Greski May 19 '18 at 12:26