0

This might sound bonkers, but looking to see if there are any ideas on how to do this.

I have N categories (say 7) where a set number of people (say 1000) have to be allocated. I know from historical data the minimum and maximum for each category (there is limited historical data, say 15 samples, so I have data that looks like this - if I had a larger sample, I would try to generate a distribution for each category from all the samples, but there isn't.

-Year 1: [78 97 300 358 132 35 0]
-Year 2: [24 74 346 300 148 84 22]
-.
-.
-Year 15:[25 85 382 302 146 52 8] 

The min and max for each category over these 15 years of data is:
Min:   [25  74  252 278 112 27 0 ]
Max:   [132 141 382 360 177 84 22]

I am trying to scale this using simulation - by allocating 1000 to each category within the min and max limits, and repeating it. The only condition is that the sum of the allocation across the seven categories in each simulation has to sum to 1000.

Any ideas would be greatly appreciated!

MrAnalyst
  • 23
  • 6

1 Answers1

0

The distribution you want is called the multinomial distribution. You can use the RandMultinomial function in SAS/IML to produce random samples from the multinomial distribution. To use the multinomial distribution, you need to know the probability of an individual in each category. If this probability has not changed over time, the best estimate of this probability is to take the average proportion in each category.
Thus, I would recommend using ALL the data to estimate the probability, not just max and min:

proc iml;
X = {...};       /* X is a 15 x 7 matrix of counts, each row is a year */
mean = mean(X);
p = mean / sum(mean);
/* simulate new counts by using the multinomial distribution */
numSamples = 10;
SampleSize = 1000;
Y = randmultinomial(numSamples, SampleSize, p);
print Y;

Now, if you insist on using the max/min, you could use the midrange to estimate the most likely value and use that to estimate the probabilty, as follows:

Min = {25  74  252 278 112 27 0};
Max = {132 141 382 360 177 84 22};
/* use midrange to estimate probabilities */
midrange = (Min + Max)/2;
p = midrange / sum(midrange);
/* now use RandMultinomial, as before */

If you use the second method, there is no guarantee that the simulated values will not exceed the Min/Max values, although in practice many of the samples will obey that criterion.

Personally, I advocate the first method, which uses the average count. Or you can use a time-weighted count, if you think recent observations are more relevant than observations from 15 years ago.

Rick
  • 1,210
  • 6
  • 11
  • Thank you, Rick. I currently use the multinomial (and the 'table' analogue using randgen) to generate samples using the average for all the data, as you suggested. The issue that I run into is the opposite of what you described - the simulated values are too close to the average, and don't span the range suggested by the min and max for each category. I am not sure, behind the scenes, how the multinomial or table methods work to generate the variation around the supplied probability in each category. I am thinking maybe I can add a random error component to the multinomial samples generated – MrAnalyst May 04 '22 at 14:59
  • It sounds like your data are mode variable than the multinomial model. Recall that for a univariate BINOMIAL distribution, the distrib is defined by the probability of an event. The variance is a DEPENDENT quantity Var(X)=np(1-p). The same is true for the MULTINOMIAL distrib: the distrib is defined by the probabilities and the variances & covariances are determined: Var(X_i)=n*p_i*(1-p_i) & cov(Xi,Xj)= -n*p_i*p_j. If the observed covariances are larger, you have overdispersion and the multinomial distrib is not a good model. – Rick May 05 '22 at 18:31