0

I have a data set with 1100 samples, target class isReturn, there are

800 isReturn='True'

300 isReturn='False'

How can I use PROC SURVEYSELECT to oversample the 300 isReturn='False' so that I will have 800 isReturn='False' to make the data set balance?

Thanks in advance.

Kevin
  • 2,191
  • 9
  • 35
  • 49
  • You have 800/200 and want a result of 800/800? Basically every row isReturn=FALSE in there four times? Or are you trying to set things up so you can bootstrap/etc. and want to be able to do so and weight the 'false' rows up so each of true/false has equal probability? – Joe May 20 '14 at 19:32
  • @Joe, it's just an example, not about picking the 200 exactly 4 times. For down-sampling, I can just specify the SIZE (less than the sample I have, let's say 150) in the `PROC SURVEYSELECT` but I am just wondering if there is a way to do up-sampling, without adding any cost (weight) to different classes. – Kevin May 20 '14 at 20:32
  • I'm just trying to figure out why you wouldn't just use the data step. – Joe May 20 '14 at 20:50
  • @Joe, because I thought such an expensive software would provide oversampling approach that most of the free opensource packages provide. Would you mind to advise what is the best way to achieve this with data step? Thanks. – Kevin May 20 '14 at 21:20
  • I may be misunderstanding your terminology. I use 'sampling' in various forms to suggest pulling a smaller sample - ie, taking 10,000 population and pulling 800. That is easy to do in SAS with surveyselect. I read your question as taking a census sample and actually adding more records to increase the sample of the smaller. Perhaps you need to explain in more detail, as your question isn't very thorough. – Joe May 20 '14 at 21:45
  • How do you want to increase those 300, in any event? What sampling method would you use? PPS with replacement? Do you want to guarantee all 300 are duplicated once, plus another 200 picked from those 300, or are you okay with (in theory) one record being there six times and some records not there? – Joe May 20 '14 at 21:49

1 Answers1

2

I may not understand what you want, but if you just want to have 800 of the false folks, you could use proc surveyselect or the data step.

The data step would give you granular control. This gives you your 300 twice, plus another 200 picked randomly (possibly 1 or 0 times) from the 300 a third time.

data have;
length isReturn $5;
do _n_=1 to 800;
  isReturn='True';
  output;
  if _n_ le 300 then do;
    isReturn='False';
    output;
  end;
end;
run;

data want;
set have;
retain k 200 n 300;
if isReturn='True' then output;
else do;
  output;
  output;
  if ranuni(7) le k/n then do;
    output;
    k+-1;
  end;
  n+-1;
end;
run;

You could tweak that pretty easily to get any distribution you want (you could take 500 out of '600' (double 300) for example by setting k and n to 500 and 600 and doing the if bit twice, each time decrementing n once).

You could also use proc surveyselect to do this.

proc surveyselect data=have(where=(isReturn='False')) out=want_add method=urs n=500 outhits;
run;

That would give you an extra 500 records, chosen at random with replacement; just add those back to the original dataset. You don't have as granular control but it is very easy to code.

Alternately, you could do this in one step. However, this does not guarantee you for either false or true a single record will always be represented - so this likely doesn't do exactly what you ask for; presented for completeness.

data sizes;
input isReturn :$5. _NSIZE_;
datalines;
False 800
True 800
;;;;
run;
proc sort data=have;
by isReturn;
run;
proc surveyselect data=have out=want method=urs n=sizes outhits;
strata isReturn;
run;

All of this assumes you're trying to get 100% of the original dataset plus some. If you're trying to oversample in the sense of pick False records with equal probability to True records, but you are ultimately picking a smaller sample than the total (and only picking each once, ie without replacement) then the strata statement is what you should be looking at.

Joe
  • 62,789
  • 6
  • 49
  • 67