I created a test data set with 6 million rows but only 2 columns and timed a few sampling methods (the two you posted plus df.sample
with the n
parameter). The sampling took a little more than 200 ms for each of the methods, which I think is reasonable fast. My data set has considerable fewer columns so this could be a reason, nevertheless I think that your problem is not caused by the sampling itself but rather loading the data and keeping it in memory.
I did not use Dask before but I assume it uses some logic to cache the data from disk or network storage. Depending on the access patterns it could be that the caching does not work very well and that chunks of the data have to be loaded from potentially slow storage on every drawn sample.
If supported by Dask, a possible solution could be to draw indices of sampled data set entries (as in your second method) before actually loading the whole data set and to only load the sampled entries. Is that an option?
Update:
Dask claims that row-wise selections, like df[df.x > 0]
can be computed fast/ in parallel (https://docs.dask.org/en/latest/dataframe.html). Maybe you can try something like this:
sampled_indices = random.sample(range(len(df)), NSAMPLES)
df_s = df[df.index in sampled_indices]
Here is the code I used for timing and some results:
import numpy as np
import pandas as pd
import random
import timeit
data = {
's_val' : list(),
'f_val' : list()}
for i in range(int(6e6)):
data['s_val'].append('item #' + str(i))
data['f_val'].append(random.random())
df = pd.DataFrame(data)
NSAMPLES = 5000
NRUNS = 50
methods = [
lambda : df.sample(n=NSAMPLES, replace=None, random_state=10),
lambda : df.sample(frac=NSAMPLES/len(df), replace=None, random_state=10),
lambda : df.loc[np.random.choice(df.index, size=NSAMPLES, replace=False)],
]
for i, f in enumerate(methods):
print('avg. time method {}: {} s'.format(
i, timeit.timeit(methods[i], number=NRUNS) / NRUNS))
Exemplary results:
avg. time method 0: 0.21715480241997284 s
avg. time method 1: 0.21541569983994122 s
avg. time method 2: 0.21495854450011392 s