-2

My question is related to sample code in 'Algorithm R' section of this link https://en.m.wikipedia.org/wiki/Reservoir_sampling

I copied below code snippet from that section. Why this code is replacing elements with gradually decreasing probability? According to the problem each item in the input should have same probability, right?

for i = k+1 to n
    j := random(1, i) 
    if j <= k
        R[j] := S[i]

For example compare Random function call for below three inputs with my reservoir size 10

  • random (1,15) chances are high for getting random numbers below 10
  • random (1, 100) chances are very low for getting random numbers below 10
  • random (1, 1000) chances are very very low for getting random numbers below 10

So chances of replacing items are very very less as input grows then how can we say that reservoir sampling algorithm is the solution for selecting random samples with equal probability on each Item? Mayou be I am missing some thing please explain.

hippietrail
  • 15,848
  • 18
  • 99
  • 158
Pradeep
  • 419
  • 5
  • 14

1 Answers1

3

It is explained in the paragraph after the algorithm, but the key observation is this: a sample candidate in R can be overwritten multiple times, but you'll only see the result of the last write.

So when i is small, you have a higher chance of replacing a sample with a new one, but for the same reason the chance of that new sample still being there when you reach the end of the loop is small.

Whereas if i gets closer to n, the chance of a value making it into R is smaller, but if it gets there, it probably won't be overwritten later.

And if you tot up all the probabilities, it will be k/n for every element.

biziclop
  • 48,926
  • 12
  • 77
  • 104
  • probability of selecting last items in to reservoir is small right? so how can we say that each item will have same probability of selecting into reservoir? and how will each item gets k/n probability? please explain. – Pradeep Aug 03 '16 at 12:50
  • 2
    @Pradeep Only the probability of it **still being selected at the end** is equal. The probability of the last item **being selected** is `k/n`. The probability of the last-but-one item being selected is `k/(n-1)`, but there is a `1/n` chance that the last item might be selected to the exact same position. So the chance of the (n-1)th item being **selected and not overwritten** is `k/(n-1) * (n-1)/n = k/n`. And so on, for all the others. – biziclop Aug 03 '16 at 12:57
  • Thanks for your patience... can you please explain why probability of not overwritten is (n-1)/n in your example? I am assuming probability of not overwritten as 1-1/k because.. for example: selecting an item from reservoir is 1/k and not selecting is 1-1/k.. may be I am missing some context here..please explain. – Pradeep Aug 03 '16 at 14:36
  • @Pradeep The algorithm is not clear to you. Take a look at it again, you will see why the probability is not `1-1/k`. In particular, @biziclop is right that the probability of the one-to-the-last (n-1) item being replaced by the last item (n) is `1/n`, because when deciding about the last item, there are `n` numbers to choose from, and only one of these numbers will replace the one-to-the-last item, hence the probability is `1/n`. – Amin.A Nov 20 '20 at 07:06