0

I have a table with 4*10^8(roughly) records, and I want to get a 4*10^6(exactly) sample of it.

But my way to get the sample is somehow special:

  1. I select 1 record from the 4*10^8 record randomly(every record has the same probability to be select).
  2. repeat step 1 4*10^6 times(no matter if one record be selected multiple times).

I think up a method to solve this:

  1. Generate a table A(num int), and there only one number in every record of table A which is random integer from 1 to n(n is the size of my original table, roughly 4*10^8 as mentioned above).
  2. Load table A as resource file to every map, and if the ordinal number of the record which is on decision now is in table A, output this record, otherwise discard it.

I think my method is not so good because if I want to sample more record from the original table, the table A will became very large and can't be loaded as resource file.

So, could any one please give an elegant algorithm?

Sayakiss
  • 6,878
  • 8
  • 61
  • 107

1 Answers1

1

I'm not sure what "elegant" means, but perhaps you're interested in something analogous to reservoir sampling. Let k be the size of the sample and initialize a k-element array with nulls. The elements from which we are sampling arrive one by one. When the jth (counting from 1) element arrives, we iterate through the array and, for each cell, replace its contents by the current element independently with probability 1/j.

Naively, the running time is pretty bad -- to sample k elements from n with replacement costs O(k n). The number of writes into the array, however, is O(k log n) in expectation, because later elements in the stream rarely result in writes. Here's an efficient method based on the exponential distribution (warning: lightly tested Python ahead). The running time is O(n + k log n).

import math
import random


def sample_from(population, k):
    for i, x in enumerate(population):
        if i == 0:
            sample = [x] * k
        else:
            t = float(k) * math.log(1.0 - 1.0 / float(i + 1))
            while True:
                t -= math.log(1.0 - random.random())
                if t >= 0.0:
                    break
                sample[random.randrange(k)] = x
    return sample
David Eisenstat
  • 64,237
  • 7
  • 60
  • 120