1

I wish to drop the columns in a SAS dataset which has a sum less than a particular value. Consider the case below.

Column_A Pred_1 Pred_2 Pred_3 Pred_4 Pred_5
A             1      1      0      1      0
A             0      1      0      1      0
A             0      1      0      1      0
A             0      1      0      1      1
A             0      1      0      0      1

Let us assume that our threshold is 4, so I wish to drop predictors having sum of active observations less than 4, so the output would look like

Column_A  Pred_2  Pred_4 
A             1       1
A             1       1
A             1       1
A             1       1
A             1       0

Currently I am using a very inefficient method of using multiple transposes to drop the predictors. There are multiple datasets with records > 30,000 so transpose approach is taking time. Would appreciate if anyone has a more efficient solution!

Thanks!

2 Answers2

2

Seems like you could do:

  1. Run PROC MEANS or similar proc to get the sums
  2. Create a macro variable that contains all variables in the dataset with sum < threshhold
  3. Drop those variables

Then no TRANSPOSE or whatever, just regular plain old summarization and drops. Note you should use ODS OUTPUT not the OUT= in PROC MEANS, or else you will have to PROC TRANSPOSE the normal PROC MEANS OUT= dataset.

An example using a trivial dataset:

data have;
  array x[20];
  do _n_ = 1 to 20;
    do _i = 1 to dim(x);
      x[_i] = rand('Uniform') < 0.2;
    end;
    output;
  end;
run;

ods output summary=have_sums;  *how we get our output;
ods html select none;          *stop it from going to results window;
proc means data=have stackodsoutput sum;  *stackodsoutput is 9.3+ I believe;
  var x1-x20;
run;
ods html select all;           *reenable normal html output;

%let threshhold=4;             *your threshhold value;

proc sql;
  select variable
    into :droplist_threshhold separated by ' '
    from have_sums
    where sum lt &threshhold;  *checking if sum is over threshhold;
quit;

data want;
  set have;
  drop &droplist_threshhold.;  *and now, drop them!;
run;
Joe
  • 62,789
  • 6
  • 49
  • 67
  • Thanks for this comment. I ran this on EG and I am getting this error when I run the proc means. ERROR: The HTML destination is not active; no select/exclude lists are available. Thoughts? – Vaishak N Chandran Dec 06 '16 at 22:55
  • You can remove the `oss html – Joe Dec 06 '16 at 22:57
  • You can remove the `ODS HTML` lines. They are only for reducing the output to the screen. – Joe Dec 06 '16 at 22:57
1

Just use PROC SUMMARY to get the sums. You can then use a data step to generate the list of variable names to drop.

%let threshhold=4;
%let varlist= pred_1 - pred_5;

proc summary data=have ;
  var &varlist ;
  output out=sum sum= ;
run;

data _null_;
  set sum ;
  array x &varlist ;
  length droplist $500 ;
  do i=1 to dim(x);
    if x(i) < &threshhold then droplist=catx(' ',droplist,vname(x(i)));
  end;
  call symputx('droplist',droplist);
run;

You can then use the macro variable to generate a DROP statement or a DROP= dataset option.

drop &droplist;
Tom
  • 47,574
  • 2
  • 16
  • 29