Python: downsample a population using frequency data

Question

Given a data series representing frequencies of elements in a population, what would be the easiest way to downsample it ?

The following population: pop = ['a', 'b', 'a', 'c', 'c', 'd', 'c', 'a', 'a', 'b', 'a']

Can be summeriezed as: freq = {'a': 5, 'c': 3, 'b': 2, 'd': 1}

Using the simple: from collections import Counter; Counter(pop)

To randomly downsample that population to 5 individuals I can do:

>>> from random import sample
>>> from collections import Counter
>>> pop = ['a', 'b', 'a', 'c', 'c', 'd', 'c', 'a', 'a', 'b', 'a']
>>> smaller_pop = sample(pop, 5)
>>> smaller_freq = Counter(smaller_pop)
>>> print smaller_freq
Counter({'a': 3, 'c': 1, 'b': 1})

But I'm searching for a way to do this directly from the freq information without having to build the pop list. You will agree that proceeding like this should not be necessary:

>>> from random import sample
>>> from collections import Counter
>>> flatten = lambda x: [item for sublist in x for item in sublist]
>>> freq = {'a': 5, 'c': 3, 'b': 2, 'd': 1}
>>> pop = flatten([[k]*v for k,v in freq.items()])
>>> smaller_pop = sample(pop, 5)
>>> smaller_freq = Counter(smaller_pop)
>>> print smaller_freq
Counter({'a': 2, 'c': 2, 'd': 1})

For memory considerations and speed requirements, I would like to avoid placing in memory the pop list. This can surely be done using some type of weighted random generator.

Would you be interested in an approximate solution? Since you are interested in downsampling, presumably the values in `freq` are very large. If so, you could floordiv (`//`) them all by some factor, then enumerate the smaller `freq` dict, and take a sample of that. — unutbu, Jun 13 '13 at 14:20

unutbu · Accepted Answer · 2013-06-14T10:22:57.517

Here is a basic algorithm that downsamples frequencies:

import random
import bisect
import collections

def downsample(freq, n):
    cumsums = []
    total = 0
    choices, weights = zip(*freq.items())
    for weight in weights:
        total += weight
        cumsums.append(total)
    assert 0 <= n <= total
    result = collections.Counter()
    for _ in range(n):
        rnd = random.uniform(0, total)
        i = bisect.bisect(cumsums, rnd)
        result[choices[i]] += 1
        cumsums = [c if idx<i else c-1 for idx, c in enumerate(cumsums)]
        total -= 1
    return result

freq = {'a': 5, 'c': 3, 'b': 2, 'd': 1}
print(downsample(freq, 5))

which prints results like

Counter({'c': 2, 'a': 1, 'b': 1, 'd': 1})

That's it, using the bisect. That seems logical now that I see it. Thanks. — xApple, Jun 13 '13 at 16:27

Python: downsample a population using frequency data

1 Answers1