2

I came across the following question:

Using rand() function, generate a number with expected value k. Options are:

1)

   int GetRandom(int k)
   {
       v=0;
       while(rand()<1.0f/(float)k)
           v++; 
       return v;
    }

2)

   int GetRandom(int k)
   {
       v=0;
       while(rand()<(1-1.0f/(float)k))
           v++; 
       return v;
    }

3)

   int GetRandom(int k)
   {
       v=0;
       while(rand() > (1-1.0f/(float)(k+1)))
           v++; 
       return v;
    }

1) seemed like the correct answer. Examining the outcome for specific values of k seems to indicate this is the not the case. (I set k=3. The frequency distribution of values for 100000 trials can be seen in the image below )

How would one do this ?

The question is somewhat similar to this one.

Community
  • 1
  • 1
curryage
  • 481
  • 1
  • 6
  • 19

1 Answers1

3

You want (2). This samples a Geometric Distribution (link) with mean k.

A geometric distribution represents an experiment of this kind:

  • A certain event happens repeatedly, with an outcome that is either 0 or 1
  • The outcome of an event is 1 with probability p and 0 with probability 1-p
  • What is the index of the first event with an outcome of 1?

So if X ~ G(p), where X is a random variable and p is the probability above, then X represents "What is the index of the first event with an outcome of 1?" The expectation is E[X] = 1/p.

Given this information it should now be clear that the following represents a sampling of the random variable X with p = 1/k (and is equivalent to (2)).

int Sample(int k)
{
    int v = 1;
    while (true)
    {
        //outcome is true with probability p = 1/k
        bool outcome = rand() < 1 / (double)k;
        if (outcome)
            return v;
        else
            v++;
    }
}

Be aware that looking at the peak (mode) and expectation of the distribution are not the same thing. The peak of the geometric distribution is always going to be at 1!

Timothy Shields
  • 75,459
  • 18
  • 120
  • 173
  • I thought the same, but the distribution of outputs as shown in frequency histogram accompanying the question shows otherwise. I am not sure what to make of it. – curryage Jun 20 '13 at 20:56
  • If you have MATLAB(or similar), you can recreate the above plot The code I have used can be seen here : http://pastebin.com/741Y4N72 – curryage Jun 20 '13 at 21:13
  • @curryage I'm saying (2) is the correct answer. Your plot is for (1), which is incorrect. – Timothy Shields Jun 20 '13 at 21:17
  • The plot changes slightly with (2), but I don't see 3 as the having the tallest peak as would be expected for input k=3. – curryage Jun 20 '13 at 21:27
  • 1
    @curryage Expectation (average) and mode (peak) are not going to be the same. When you use `k=3`, sample a whole bunch of values, then average those values. It will be near 3. - The mode (peak) of a geometric distribution is always going to be 0, regardless of what `p` you choose! – Timothy Shields Jun 20 '13 at 21:34
  • For `k=3`, the average I obtained was `1.999`. For other values of 'k`, I get an smaller-by-1 result. Changing the `(double)k` to `(double)(k+1)` fixed it. Why would that be the case ? – curryage Jun 20 '13 at 21:40
  • 1
    @curryage That's the wrong way to fix it. You should instead be starting with `int v = 1` (instead of `int v = 0`). There's nothing wrong with the reasoning in the answer - it's just a matter of whether you want things 0-indexed or 1-indexed. – Timothy Shields Jun 20 '13 at 21:43
  • Essentially you're generating discrete random variables in this problem, and several distributions could be used. If you want k=1 to be a possibility, the geometric distribution only works when p(1)=1, i.e., you only produce 1's. If you want k=1 as a possibility and to still have distributional behavior, you could switch to a [Poisson generator](http://en.wikipedia.org/wiki/Poisson_distribution#Generating_Poisson-distributed_random_variables) rather than the geometric. – pjs Jun 20 '13 at 22:09
  • @pjs I think the point was that the OP had to choose between the 3 sampling functions provided (probably homework). There are many kind of distributions over natural numbers - geometric and Poisson are just two. However, the sampling functions in the question were attempts at a geometric distribution specifically. – Timothy Shields Jun 20 '13 at 22:15
  • @Timothy Shields The question was not homework. Thanks for the explanation anyway. – curryage Jun 20 '13 at 22:20
  • @TimothyShields No dispute from me. I just noticed that even though no restrictions were stated for k, you couldn't get variation if k was zero or one. I wanted to throw it out there that alternative implementations could yield k=1 and still give random outcomes. – pjs Jun 20 '13 at 22:22
  • @Timothy Shields : The `Sample()` function works as intended and I "sort of" get how it produces the expected value. I am trying to relate it to the usual way of sampling from discrete distribution where you generate from U[0,1] and determine what to output based on the CDF range it falls in. I guess I still do not see how `Sample()` produces the expected value. Online resources provide a logarithmic function of U[0,1] as sampled values, but the method you provide works as well. I am trying to reconcile the two. – curryage Jun 21 '13 at 04:31
  • 1
    @curryage The `Sample()` function doesn't produce the expected value. What it produces is a sequence of values from a distribution (geometric) that Timothy Shields tuned to have the desired expected value. Expected value is a mathematical property of the underlying distribution, not a characteristic of a sample - once you've specified the distribution its expectation can be calculated without any sampling. However, it can be _estimated_ by taking the average of values generated by `Sample()`. As the sample size increases that average should converge towards the expected value. – pjs Jun 21 '13 at 17:28
  • @pjs : I guess what I am trying to wrap my head around is this: How would one "prove" that using the `Sample()` function generates samples from the geometric distribution ? Also, I was trying to relate it to how the `rand()` uniform distribution is typically used to sample from a discrete probability distribution. – curryage Jun 21 '13 at 19:19
  • 2
    @curryage You can't prove it, but you can confirm that it meets the assumptions for a geometric distribution: 1) the outcome of each trial is a "success" or a "failure" with constant probability per trial; 2) the trials are independent of each other; and 3) you keep going until you get a success. The geometric distribution then describes the probability that the first success occurs on the n-th trial, for all values of n from 1 to infinity. Now look at `Sample()`. Fixed probability of success: Check! Independent trials: Check! Keep going until success occurs: Check! It's geometric. – pjs Jun 21 '13 at 22:06
  • @curryage An alternate formulation is "how many failures did you see before your first success?" That formulation has zero as a possible outcome, rather than starting at one. – pjs Jun 21 '13 at 22:10
  • @pjs "You can't prove it." I think you did at some level. ;) – Timothy Shields Jun 21 '13 at 22:19
  • @pjs : If I have a function like so `bool RunExperiment(float p) { if ( rand0to1() < p) return true; else return false; }` and I call this function `N` times in a loop. If 'n' is the number of times I get a success and I obtain `n/N approximately equals p`, this means that `RunExperiment()` has probability 'p' of success. The following code verifies this: http://ideone.com/oXfEfG . But how is it that `rand0to1() < p` ensures this happens ? Specifically, why '`<` ? It would seem like `rand0to1() == p` could do the job, but it doesn't since it is meaningless for a continuous distribution. – curryage Jun 22 '13 at 10:55
  • 1
    @curryage `rand0to1()` produce values uniformly between 0 and 1. For any interval `[a,b]` with `0 <= a < b <= 1` the long-run proportion of `rand0to1()` outcomes that fall in that range is `b-a`, the length of the range. If you test `rand0to1() <= p`, you're checking for an outcome in the range `[0,p]`, which has length `p-0=p` and therefore occurs proportion p of the time. It would have been equally valid to check `rand0to1() >= 1-p`, or any other range between 0 and 1 of length p. `rand0to1() <= p` is easy. Using < instead of <= introduces a very tiny computational error. – pjs Jun 22 '13 at 16:03
  • @pjs : Thanks a lot for your patient and illuminating explanation. – curryage Jun 23 '13 at 03:28