Why does random sampling scale with the dataset not the sample size? (pandas .sample() example)

Question

When sampling randomly from distributions of varying sizes I was surprised to observe that execution time seems to scale mostly with the size of the dataset being sampled from, not the number of values being sampled. Example:

import pandas as pd
import numpy as np
import time as tm

#generate a small and a large dataset
testSeriesSmall = pd.Series(np.random.randn(10000))
testSeriesLarge = pd.Series(np.random.randn(10000000))

sampleSize = 10
tStart = tm.time()
currSample = testSeriesLarge.sample(n=sampleSize).values
print('sample %d from %d values: %.5f s' % (sampleSize, len(testSeriesLarge), (tm.time() - tStart)))

tStart = tm.time()
currSample = testSeriesSmall.sample(n=sampleSize).values
print('sample %d from %d values: %.5f s' % (sampleSize, len(testSeriesSmall), (tm.time() - tStart)))

sampleSize = 1000
tStart = tm.time()
currSample = testSeriesLarge.sample(n=sampleSize).values
print('sample %d from %d values: %.5f s' % (sampleSize, len(testSeriesLarge), (tm.time() - tStart)))

tStart = tm.time()
currSample = testSeriesSmall.sample(n=sampleSize).values
print('sample %d from %d values: %.5f s' % (sampleSize, len(testSeriesSmall), (tm.time() - tStart)))

The output is:

sample 10 from 10000 values: 0.00126 s
sample 10 from 10000000 values: 1.10504 s
sample 1000 from 10000 values: 0.00122 s
sample 1000 from 10000000 values: 1.15000 s

This seems counterintuitive. Maybe I'm dense, but the problem seems similar to generating a list of random indices, and I would have expected the number of values sampled to matter and the size of the dataset not to matter much. I've tried another implementation or two with similar results and it's beginning to feel like I'm just missing a fundamental issue, though.

My questions are twofold: (1) Is this a fundamental issue or a quirk of implementation in pandas? (2) Is there a significantly faster approach that can be taken to randomly sample from large datasets in this way?

John Zwinck · Accepted Answer · 2017-03-25T02:22:54.803

pandas.Series.sample() in your case boils down to this:

rs = np.random.RandomState()
locs = rs.choice(axis_length, size=n, replace=False)
return self.take(locs)

The slow part is rs.choice():

%timeit rs.choice(100000000, size=1, replace=False)
1 loop, best of 3: 9.43 s per loop

It takes about 10 seconds to generate a single random number! If you divide the first argument by 10, it takes about 1 second. That's slow!

If you use replace=True it's super fast. So that's one workaround for you, if you don't mind having duplicate entries in your results.

The NumPy documentation for choice(replace=False) says:

This is equivalent to np.random.permutation(np.arange(5))[:3]

Which pretty much explains the problem--it generates a huge array of possible values, shuffles them, then takes the first N. This is the root cause of your performance problem, and has already been reported as an issue in NumPy here: https://github.com/numpy/numpy/pull/5158

It is apparently difficult to fix in NumPy, because people rely on the result of choice() not changing (between versions of NumPy) when using the same random seed value.

Since your use case is quite narrow, you can do something like this:

def sample(series, n):
    locs = np.random.randint(0, len(series), n*2)
    locs = np.unique(locs)[:n]
    assert len(locs) == n, "sample() assumes n << len(series)"
    return series.take(locs)

That gives much faster times:

sample 10 from 10000 values: 0.00735 s
sample 10 from 1000000 values: 0.00944 s
sample 10 from 100000000 values: 1.44148 s
sample 1000 from 10000 values: 0.00319 s
sample 1000 from 1000000 values: 0.00802 s
sample 1000 from 100000000 values: 0.01989 s
sample 100000 from 1000000 values: 0.05178 s
sample 100000 from 100000000 values: 0.93336 s

Oh, geeze. I would have expected "equivalent to" to just be describing the semantics, but apparently it's [actually implemented that way](https://github.com/numpy/numpy/blob/3e297ba7e47c949469744ac67cf296b4315ceea9/numpy/random/mtrand/mtrand.pyx#L1174). That is slow as hell. — user2357112, Mar 25 '17 at 02:11

Ted Petrou · Answer 2 · 2017-03-25T02:13:37.850

This looks to be an internal numpy issue. I believe the pandas sample method calls numpy.random.choice. Let's take a look at how numpy performs with various array sizes and sample sizes.

Create arrays

large = np.arange(1000000)
small = np.arange(1000)

Time the sample without replacement

%timeit np.random.choice(large, 10, replace=False)
10 loops, best of 3: 27.4 ms per loop

%timeit np.random.choice(small, 10, replace=False)
10000 loops, best of 3: 41.4 µs per loop

Time the sample with replacement

%timeit np.random.choice(large, 10, replace=True)
100000 loops, best of 3: 11.7 µs per loop

%timeit np.random.choice(small, 10, replace=True)
100000 loops, best of 3: 12.2 µs per loop

Very interestingly, when doing the sample without replacement, the large array is taking nearly 3 orders of magnitude longer and it is exactly three orders of magnitude as large. This suggests to me that numpy is randomly sorting the array and then taking the first 10 items.

When sampling with replacement, each value is independently chosen so the timings are identical.

Why does random sampling scale with the dataset not the sample size? (pandas .sample() example)

2 Answers2