How do I randomly sample from a list in python while maintaining the distribution of data

Question

Essentially what I'm trying to do is randomly select items from a list while maintaining the internal distribution. See the following example.

a = 17%
b = 12%
c = 4%
etc.

"a" has 1700 items in the list. "b" has 1200 items in the list. "c" has 400 items in the list.

Instead of using all information, I want a sample that mimics the distribution of a, b, c, etc.

So the goal would be to end up with,

170 randomly selected items from "a" 120 randomly selected items from "b" 40 randomly selected items from "c"

I know how to randomly select information from the list, but I haven't been able to figure out how to randomly select while forcing the outcome to have the same distribution.

You can't force the sample to resemble the population, it's random. — Patrick Haugh, Apr 03 '17 at 19:42
Can you please clarify this? You have three lists, or you want to sub-divide a single sample into three lists randomly? — roganjosh, Apr 03 '17 at 19:43
For example [`numpy.random.choice`](https://docs.scipy.org/doc/numpy-dev/reference/generated/numpy.random.choice.html) allows you to pass a probability parameter (a list of probabilities), but I'm struggling to understand what you're trying to do, so I don't know if it's appropriate. — roganjosh, Apr 03 '17 at 19:46
I have to force it to resemble the population. There is a very specific reason for doing so. I understand that it is not "random" per se, but I still want a random selection of items within each category (a, b, c, etc). — Stats_kid, Apr 03 '17 at 19:46
I want to randomly sample from a population, but I want to do so where each item in my list has a relative probability associated with it. So "a" would have a probability of .17 of being selected. If I can input probabilities like that, then the sample will resemble the population. — Stats_kid, Apr 03 '17 at 19:48
Possible duplicate of [Generating Discrete random variables with specified weights using SciPy or NumPy](http://stackoverflow.com/questions/11373192/generating-discrete-random-variables-with-specified-weights-using-scipy-or-numpy) — pjs, Apr 05 '17 at 21:23
See https://hips.seas.harvard.edu/blog/2013/03/03/the-alias-method-efficient-sampling-with-many-discrete-outcomes/ for a python implementation of the "alias method", which requires O(k) setup for a distribution with k outcomes, but is then O(1) per value to generate from. — pjs, Apr 05 '17 at 21:25

Eric Duminil · Answer 1 · 2017-04-04T12:54:02.107

If your lists aren't humongous and if memory isn't a problem, you could use this simple method.

To get n elements from a, b and c, you could concatenate the three lists together and pick random elements from the resulting list with random.choice:

import random

n = 50
a = ['a'] * 170
b = ['b'] * 120
c = ['c'] * 40
big_list = a + b + c
random_elements = [random.choice(big_list) for i in range(n)]
# ['a', 'c', 'a', 'a', 'a', 'b', 'a', 'c', 'b', 'a', 'c', 'a',
# 'a', 'a', 'a', 'b', 'b', 'a', 'a', 'a', 'a', 'a', 'c', 'a',
# 'c', 'a', 'b', 'a', 'a', 'c', 'a', 'b', 'a', 'c', 'b', 'a',
# 'a', 'b', 'a', 'b', 'a', 'a', 'c', 'a', 'c', 'a', 'b', 'c',
# 'b', 'b']

For each element, you'll get a len(a) / len(a + b + c) probability to get an element from a.

You might get the same element multiple times though. If you don't want this to happen, you could use random.shuffle.

This is simple and might be enough here. But if performance is somehow important this naive approach is not spectacular. The kind of time-memory-tradeoff made here can be bad in regards to caching-behaviour in practice (and uses much more memory than needed; a lot of redundancy). — sascha, Apr 03 '17 at 23:01

roganjosh · Answer 2 · 2017-04-03T20:10:06.410

From my understanding, you have three distinct populations and you want to sample from these populations randomly, but with a skewed probability of picking certain populations. In this case, it's easier to first generate a list of indices randomly that correspond to each population (as I combined them into a single 2D array called combined).

Then you can traverse the list of randomly generated indices, which gives you the population you're going to choose from, and then randomly pick from that data using np.random.choice().

import numpy as np

sample_a = np.arange(1, 1000)
sample_b = np.arange(1001, 2000)
sample_c = np.arange(2001, 3000)

combined = np.vstack((sample_a, sample_b, sample_c))

distributions = [0.7, 0.2, 0.1] # The skewed probability distribution for sampling

sample = np.random.choice([0, 1, 2], size=10, p=distributions) # Choose indices with skewed probability

combined_pool = []

for arr in sample:
    combined_pool.append(np.random.choice(combined[arr]))

score 0 · Answer 3 · answered Apr 03 '17 at 20:45

One way to "mimic" such a distribution in your selection would be to simply combine the lists into one and then select the total needed number of items from that list. If the total number of items that needs to be selected is large, then this approximation will be good.

Note that it does not guarantee that exactly those quantities from each list will be selected. However, if the lists are large and there are many runs of this routine, the average should be good.

import random
 total = a + b + c + ...
 samples = []
 number = len(total) / 10
 for i in range(number):
     samples.append(total[random.rand(0, len(total) - 1])

score 0 · Answer 4 · answered Nov 03 '17 at 11:09

It's pretty easy to do this manually. Let's store your data in a list of (value, probability) objects:

data = [(a, 0.17), (b, 0.12), (c, 0.04), ...]

This is the function that will help you select random values that follow the probability distribution:

import random
def select_random_element(data):
    sample_proba = random.uniform(0, 1)
    total_proba = 0
    for (value, proba) in data:
        total_proba += proba
        if total_proba >= sample_proba:
            return value

Finally, this is how we select N random items:

random_items = [select_random_element(data) for _ in range(0, N)]

This does not require any additional memory. However, the time complexity is O(len(data)*N). This can be improved by sorting the data list by decreasing probability beforehand:

data = sorted(data, key=lambda i: i[1], reverse=True)

Note that I assumed that the total probability of your data is 1. If not, you should write random.uniform(0, total_probability) instead of random.uniform(0, 1) in the above code, with:

total_probability = sum([i[1] for i in data])

score 0 · Answer 5 · answered Dec 31 '21 at 13:56

a pandas series/dataframe has a .sample() method that allows a 'weights' series to be included.

if a dataframe, that weight can be a column adjacent to the data.

make your category totals that weight column, specify that column in your .sample() call, and you're done.

https://pandas.pydata.org/docs/reference/api/pandas.Series.sample.html

score -1 · Answer 6 · answered Apr 03 '17 at 20:51

-1

Just use shuffle on your list, and take the first n elements.

answered Apr 03 '17 at 20:51

Binyamin Even

3,318
1
18
45

On which list? OP has at least 3. Note: I didn't downvote. `shuffle` is an interesting idea because it would avoid getting duplicate elements. – Eric Duminil Apr 03 '17 at 20:52

How do I randomly sample from a list in python while maintaining the distribution of data

6 Answers6