0

I am generating about 100 million random numbers to pick from 300 things. I need to set it up so that I have 10 million independent instances (different seed) that picks 10 times each. The goal is for the aggregate results to have very low discrepancy, as in, each item gets picked about the same number of times.

The problem is with a regular prng, some numbers get chosen more than others. (tried lcg and mersenne twister) The difference between the most picked and least picked can be several thousand, to ten thousands) With linear congruity generators and mersenne twister, I also tried picking 100 million times with 1 instance and that also didn't yield uniform results. I'm guessing this is because the period is very long, and perhaps 100 million isn't big enough. Theoretically, if I pick enough numbers, the results should reach uniformity. (should settle at the expected value)

I switched to Sobol, a quasirandom generator and got much better results with the 100 million from 1 instance test. (difference between most picked and least picked is about 5) But splitting them up to 10 million instances at 10 times each, the uniformity was lost and I got similar results as with the prng. Sobol seem very sensitive to sequence - skipping ahead randomly diminishes uniformity.

Is there a class of random generators that can maintain quasirandom-like low discrepancy even when combining 10 million independent instances? Or is that theoretically impossible? One solution I can think of now is to use 1 Sobol generator that is shared across 10 million instances, so effectively it is the same as the 100 million from 1 instance test.

  • If you want perfect uniformity, why not use a variant of a [shuffle](http://en.wikipedia.org/wiki/Fisher-Yates_shuffle)? Logically, you would start with a vector filled with 100million/300 copies of each value, and then shuffle it. (although obviously, 100million doesn't divide exactly by 300...) – Oliver Charlesworth Jun 02 '12 at 16:03
  • well, the goal is to have 10 million instances with 10 values each (randomly chosen from 300 items) and still have uniformity at the aggregate level. Shuffling will maintain uniformity if done in 1 instance, but will still be subject to discrepancy issues if done in 10 million independent instances. – user1432577 Jun 04 '12 at 17:18
  • Do you really need those many instances of the RNG or can you just use one RNG and distribute its output in round robin fashion across your 10e6 "buckets"? - Besides: How do you map the output of your RNG to the required range of [0,299]? Simple modulus 300 operations will definitely cause some bias from every RNG that does not natively produce output words across a [0,n x 300] range. – JimmyB Jun 05 '12 at 10:33
  • Additionally, it is certainly a wrong assumption that, given an _arbitrary_ number `n` of samples any (pseudo-)random process will produce exactly uniform distribution. Having this property would make the output of the process somewhat predictable, which is in fact the opposite of what one wants. – JimmyB Jun 05 '12 at 10:43
  • The mapping is done through a floor(random float * 300) so it won't have the modulo bias. Also, the goal is uniformity, rather than pure randomness, which was why quasirandom was considered. One solution is to use a single generator and distribute like you mentioned, but I posted to see if there was a different way. Also, you are right in that arbitrary n number of samples from pseudorandom will not be uniform. I just found it strange that 100M samples still did not near uniformity, since at 100M, I though you'd start seeing effects of rule of large numbers. – user1432577 Jun 05 '12 at 13:56

1 Answers1

0

Both the shuffling and proper use of Sobol should give you uniformity as desired. Shuffling needs to be done at the aggregate level (start with a global 100M sample having the desired aggregate frequencies, then shuffle it to introduce randomness, and finally split into the 10 values instances; shuffling within each instance wouldnt help globally, as you noted).

But that's an additional level of uniformity, you might not really need that: randomness might be enough. First of all I would check the check itself, because it sounds strange that with enough samples you're really getting significant deviations (check "chi square test" to qualify such significance, or equivalently how many are "enough" samples). So for a first safety check: if you're picking independent values, then simplify differently to 10M instances picking 10 out 2 categories: do you get approximately a binomial distribution? For exclusive picking it's a different distribution (hypergeometric iirc, but need to check). Then generalize to more categories (multinomial distribution) and only later it's safe to proceed with your problem.

Quartz
  • 1
  • 1
  • 2