0

Hopefully a simple answer. I'm doing a simulation study, where I need to sample a random number of individuals, N, from a uniform distribution, U(25,200), at each of a thousand or so replications. Code for one replication is shown below:

%LET U = RAND("UNIFORM");
%LET N = ROUND(25 + (200 - 25)*&U.);

I created both of these macro variables outside of a DATA step because I need to call the N variable repeatedly in subsequent DATA steps and DO loops in both SAS and IML.

The problem is that every time I call N within a replication, it re-samples U, which necessarily modifies N. Thus, N is not held constant within a replication. This issue is shown in the code below, where I first create N as a variable (that is constant across individuals) and sample predictor values for X for each individual using a DO loop. Note that the value in N is not the same as the total number of individuals, which is also a problem.

DATA ID; 
    N = &N.;
    DO PersonID = 1 TO &N.;
        X = RAND("NORMAL",0,1); OUTPUT;
    END;
RUN;

I'm guessing that what I need to do is to somehow hold U constant throughout the entirety of one replication, and then allow it to be re-sampled for replication 2, and so on. By holding U constant, N will necessarily be held constant.

Is there a way to do this using macro variables?

Joe
  • 62,789
  • 6
  • 49
  • 67
Ryan W.
  • 21
  • 4

3 Answers3

0

I'm not sure how to do it in the macro world, but this is how you could convert your code to a data step to accomplish the same thing.

The key is setting the random number stream initialization value, using CALL STREAMINIT.

Data _null_;
call streaminit(35);
u=rand('uniform');
call symput('U', u);
call symput('N',  ROUND(25 + (200 - 25)*U));
run;


%put &n;
%put &u;
Reeza
  • 20,510
  • 4
  • 21
  • 38
  • While CALL STREAMINIT is a great idea, in his case it's not the problem: the problem is his code is `do PersonID = 1 to ROUND(25+(200-25)*RAND('UNIFORM'));` which re-creates the loop end each time with a new value. – Joe Feb 19 '15 at 18:31
0

&N does not store a value. &N stores the code "ROUND(...(RAND..." etc. You're misusing macro variables, here: while you could store a number in &N you aren't doing so; you have to use %sysfunc, and either way it's not really the right answer here.

First, if you're repeatedly sampling replicates, look at the paper Don't be Loopy', which has some applications here. Also consider Rick Wicklin's paper, Sampling with Replacement, and his book that he references ("Simulating Data in SAS") in there is quite good as well. If you're running your process on a one-sample-one-execution model, that's the slow and difficult to work with way. Do all the replicates at once, process them all at once; IML and SAS are both happy to do that for you. Your uniform random sample size is a bit more difficult to work with, but it's not insurmountable.

If you must do it the way you're doing it, I would ask the data step to create the macro variable, if there's a reason to do that. At the end of the sample, you can use call symput to put out the value of N. IE:

%let iter=7; *we happen to be on the seventh iteration of your master macro;
DATA ID;
    CALL STREAMINIT(&iter.); 
    U = RAND("UNIFORM");
    N = ROUND(25 + (200 - 25)*U);
    DO PersonID = 1 TO N;
        X = RAND("NORMAL",0,1); 
        OUTPUT;
    END;
    CALL SYMPUTX('N',N);
    CALL SYMPUTX('U',U);
RUN;

But again, a one-data-step model is probably your most efficient model.

Joe
  • 62,789
  • 6
  • 49
  • 67
  • Thanks Joe! And, yes, my simulation process is clunky. For me, it's the difference between 5 minutes and 10 minutes. Gives me some necessary time to get out of my chair and walk around a bit. :) – Ryan W. Feb 19 '15 at 18:48
0

As Joe points out, the efficient way to perform this simulation is to generate all 1000 samples in a single data step, as follows:

data AllSamples;
call streaminit(123);
do SampleID = 1 to 1000;
   N = ROUND(25 + (200 - 25)*RAND("UNIFORM"));
   /* simulate sample of size N HERE */
   do PersonID = 1 to N;
      X = RAND("NORMAL",0,1);   
      OUTPUT;
   end;
end;
run;

This ensures independence of the random number streams, and it takes a fraction of a second to produce the 1000 samples. You can then use a BY statement to analyze the sampling distributions of the statistics on each sample. For example, the following call to PROC MEANS outputs the sample size, sample mean, and sample standard deviation for each of the 1000 samples:

proc means data=AllSamples noprint;
by SampleID;
var X;
output out=OutStats n=SampleN mean=SampleMean std=SampleStd;
run;

proc print data=OutStats(obs=5);
var SampleID SampleN SampleMean SampleStd;
run;

For more details about why the BY-group approach is more efficient (total time= less than 1 second!) see the article "Simulation in SAS: The slow way or the BY way."

Rick
  • 1,210
  • 6
  • 11