1

I am trying to simulate data using an empirical distribution. For example, say there are five outcomes with probabilities as shown in the vector below:

PROBABILITY_VECTOR = [0.1, 0.2, 0.3, 0.25, 0.15]

The PROBABILITY_VECTOR is calculated from empirical data - so for the first category in that vector, while the average probability is 0.1, there is considerable variance among the samples. Similarly, the last category, while average from all the samples is 0.15, there is considerable variance. The middle categories with 0.3 and 0.25 probabilities are fairly tight.

I use the PROC IML, with these statements:

CALL RANDSEED(12345);
CALL RANDGEN(SAMPLE, "TABLE", PROBABILITY_VECTOR);

When I do this, the average of all the simulated outcomes is consistent with the probability vector, as you would expect. But if I want my simulated trials to also show the wide variance that I observe in some of the categories in my data, how do I do that? Any ideas?

Joe
  • 62,789
  • 6
  • 49
  • 67
MrAnalyst
  • 23
  • 6
  • Use the [tag:sas-iml] tag, not the [tag:iml] tag which probably shouldn't exist. Rick Wicklin sometimes will answer questions in that tag - though probably more frequently at communities.sas.com. – Joe Sep 03 '21 at 16:52
  • Sounds like you have 6 groups and a binary variable. The vector seems to be the mean of a binary variable on each group. If this is correct, then sample from a mixture model. For each category, you need the pi[i]=prob that an observation is in the group, which is N[i]/N. See https://blogs.sas.com/content/iml/2011/09/21/generate-a-random-sample-from-a-mixture-distribution.html for a Gaussian example, but I think your example will simulate from a Bernoulli distribution. – Rick Sep 04 '21 at 09:48
  • @Rick Appreciate the response! Will look into that post, and research a bit more on how to simulate mixture models! – MrAnalyst Sep 15 '21 at 14:43
  • Actually, I put some additional thoughts into this blog post: https://blogs.sas.com/content/iml/2021/09/09/simulate-proportions-groups.html I did not submit it as an answer because I am not clear about certain details in your question. If you think my recent blog post is answers your question, I can suggest it as an answer. – Rick Sep 16 '21 at 15:04
  • @Rick it does answer my question, this is great! In fact, looking at the plots, it points to the wide variation in Group 1 and Group 6, which is exactly similar to what I observe in my data (it is observational, not a designed experiment), and I am trying to simulate better. I really appreciate you coming up with the sample data and simulation. Please submit as answer, and I will vote! Also did not know about strip plots before, so it is a very helpful visualization of the density within each group. – MrAnalyst Sep 17 '21 at 17:58

1 Answers1

3

It sounds like you have k groups of subjects, and the sizes of the groups are N_1, N_2, ..., N_k. For each group, you have measured the proportion of subjects that have some characteristic of interest. The proportions are p_1, p_2, ..., p_k.

To simulate data like these, first take a random draw from a multinomial distribution that has N=N_1+N_2+...+N_k subjects and the probability of membership is N_1/N, N_2/N, ..., N_k/N. This will give you a new sample that N subjects spread across k groups, and each group has approximately the same number of subjects as the data. This explains why some groups have "wide variance" whereas others are "tight."

To simulate which subjects in the group have the characteristic, use the binomial(p_i, N_i) distribution. This will randomly assign the characteristic to some of the subjects in the i_th group.

If you repeat this process over and over, you will see that the smaller groups have more variation than the larger groups. I have written a detailed explanation, including a SAS/IML program and graphics that visualize the variation among the groups. See the article, "Simulate proportions for groups."

Rick
  • 1,210
  • 6
  • 11