1

I'm simulating the model where there are N marbles, out of which K marbles are good. We pick n marbles out of N marbles and are asked for the probability that exactly k out of the n picked ones are good.

I did this two ways: In both I generated an array containing K 'true' values and N-K 'false' values. But in the first method I shuffled this array and picked the n first values and counted how many of these are 'true'. In the second method I picked an index at random and removed that element from the array, looping this n times (and of course counting the 'true' elements I got).

The resulting distribution should be HyperGeometric(N, K, n). The first method gave me wrong results whereas the second gave the correct result. Why isn't it OK to pick the n first elements of the shuffled array or what else am I doing wrong? Here's my Javascript code:

function pickGoodsTest(N, K, n) {
    var origArr = generateArr(N, i=> i<K);
    shuffle(origArr);
    var goods = 0;
    for (let i=0; i<n; i++) if(origArr[i]) goods++;
    return goods;
}

function pickGoodsTest2(N, K, n) {
    var origArr = generateArr(N, i=> i<K);
    var goods = 0;
    for (let i=0; i<n; i++) {
        let rndInd = randInt(0, origArr.length-1);
        let wasGood = origArr.splice(rndInd, 1)[0];
        if (wasGood) goods++;
    }
    return goods;
}

//helper functions:

function generateArr(len, indFunc) {
    var ret = [];
    for (let i=0; i<len; i++) {
        ret.push(indFunc(i));
    }
    return ret;
}

function randInt(a, b){return a+Math.floor( Math.random()*(b-a+1) );}

function shuffle(arr) {
    let arrLen = arr.length;
    for (let i=0; i<arrLen; i++) {
        let temp = arr[i];
        let rndInd = randInt(0, arrLen-1);
        arr[i] = arr[rndInd];
        arr[rndInd] = temp;
    }
}

These are plots of the the outcomes with values N=10, K=6, n=5 (simulated 500000 times):

enter image description here

The yellow dot is the value of the hypergeometric pmf.

ploosu2
  • 413
  • 6
  • 15

2 Answers2

3

The way you shuffle the array is biased, I would suggest to use Fisher-Yates shuffle instead:

function shuffle(arr) {
    let arrLen = arr.length;
    for (let i=0; i<arrLen; i++) {
        let temp = arr[i];
        let rndInd = randInt(0, i);
        arr[i] = arr[rndInd];
        arr[rndInd] = temp;
    }
}
user1470500
  • 652
  • 5
  • 14
  • Thanks! I have always been using the former way of shuffling without thinking if it is biased. Fisher-Yates shuffle produces the correct outcome (as expected, since it is unbiased as Wikipedia says). – ploosu2 Sep 28 '17 at 10:19
3

The code below proves that your shuffle mechanism is wrong. Code is shuffling an array of size 3 in all possible outcome of random and collects statistics of chance for a number to be in the specific position.

import java.util.Arrays;

public class TestShuffle {
    public static void main(String[] args) {
        int[][] stat = new int[3][3];

        for (int i = 0; i < 3; i++) {
            for (int j = 0; j < 3; j++) {
                for (int k = 0; k < 3; k++) {
                    int[] y = {0, 1, 2};
                    swap(y, 0, i);
                    swap(y, 1, j);
                    swap(y, 2, k);

                    stat[0][y[0]]++;
                    stat[1][y[1]]++;
                    stat[2][y[2]]++;
                }
            }
        }

        System.out.println(Arrays.deepToString(stat));
    }

    private static void swap(int[] y, int i, int k) {
        int tmp = y[i];
        y[i] = y[k];
        y[k] = tmp;
    }
}

Output is

[[9, 10, 8], [9, 8, 10], [9, 9, 9]]

This means that the chance for the number "1" to be in the position 0 is greater than 1/3. It is 10/27.

Gedrox
  • 3,592
  • 1
  • 21
  • 29