I am trying to simulate data using an empirical distribution. For example, say there are five outcomes with probabilities as shown in the vector below:
PROBABILITY_VECTOR = [0.1, 0.2, 0.3, 0.25, 0.15]
The PROBABILITY_VECTOR is calculated from empirical data - so for the first category in that vector, while the average probability is 0.1, there is considerable variance among the samples. Similarly, the last category, while average from all the samples is 0.15, there is considerable variance. The middle categories with 0.3 and 0.25 probabilities are fairly tight.
I use the PROC IML, with these statements:
CALL RANDSEED(12345);
CALL RANDGEN(SAMPLE, "TABLE", PROBABILITY_VECTOR);
When I do this, the average of all the simulated outcomes is consistent with the probability vector, as you would expect. But if I want my simulated trials to also show the wide variance that I observe in some of the categories in my data, how do I do that? Any ideas?