This link "http://had00b.blogspot.com/2013/07/random-subset-in-mapreduce.html" talks about how one can implement reservoir sampling using map reduce framework. I feel their solution is complicated and the following simpler approach would work.
Problem: Given very large number of samples, generate a set of size k such that each sample has equal probability of being present in the set.
Proposed solution:
- Map operation: For each input number n, output (i, n) where i is randomly chosen in range 0 to k-1.
- Reduce operation: Among all numbers with same key, pick one randomly.
Claim: Probability of any number being in set of k size is k/n (where n is total number of samples)
Proof intuition:
Since map operation randomly assigned each input sample to bucket number i (0 <= i <= k-1), size of each bucket would be n/k. Now each number is present only in one bucket, suppose bucket i. The probability that it gets picked in reduce operation for bucket i is 1/(n/k) = k/n
I would appreciate any second thoughts on my solution, whether it seems correct or not.