1

I have a huge file that I read with Dask (Python). The file is around 6 million rows and 550 columns. I would like to select a random sample of 5000 records (without replacement). Here are the 2 methods that I tried, but it takes a huge amount of time to run (I stopped after more than 13 hours):


df_s=df.sample(frac=5000/len(df), replace=None, random_state=10)


NSAMPLES=5000
samples = np.random.choice(df.index, size=NSAMPLES, replace=False)
df_s=df.loc[samples]

I am not sure that these are appropriate methods for Dask data frames. Is there a faster way to select records randomly for huge data frames?

Falco
  • 183
  • 2
  • 16
  • this is the only SO post I could fins about this topic. Perhaps, trying some slightly different code per the accepted answer will help: https://stackoverflow.com/questions/39175963/sampling-n-2000-from-a-dask-dataframe-of-len-18000-generates-error-cannot-take – David Erickson Jul 14 '20 at 08:34
  • I found something on the *Dask* website and updated my answer. – Stefan Scheller Jul 14 '20 at 08:48
  • @Falco Did you got solution for that? If yes can you please post. I'm looking for same and didn't got anything – Xyz Feb 25 '21 at 22:37
  • @LoneWalker unfortunately I have not found any solution for this...I hope someone else can help! – Falco Mar 02 '21 at 14:31

2 Answers2

1

I created a test data set with 6 million rows but only 2 columns and timed a few sampling methods (the two you posted plus df.sample with the n parameter). The sampling took a little more than 200 ms for each of the methods, which I think is reasonable fast. My data set has considerable fewer columns so this could be a reason, nevertheless I think that your problem is not caused by the sampling itself but rather loading the data and keeping it in memory.

I did not use Dask before but I assume it uses some logic to cache the data from disk or network storage. Depending on the access patterns it could be that the caching does not work very well and that chunks of the data have to be loaded from potentially slow storage on every drawn sample.

If supported by Dask, a possible solution could be to draw indices of sampled data set entries (as in your second method) before actually loading the whole data set and to only load the sampled entries. Is that an option?


Update:

Dask claims that row-wise selections, like df[df.x > 0] can be computed fast/ in parallel (https://docs.dask.org/en/latest/dataframe.html). Maybe you can try something like this:

sampled_indices = random.sample(range(len(df)), NSAMPLES)
df_s = df[df.index in sampled_indices]

Here is the code I used for timing and some results:

import numpy as np
import pandas as pd
import random
import timeit

data = {
    's_val' : list(),
    'f_val' : list()}
for i in range(int(6e6)):
    data['s_val'].append('item #' + str(i))
    data['f_val'].append(random.random())
df = pd.DataFrame(data)

NSAMPLES = 5000
NRUNS = 50
methods = [
    lambda : df.sample(n=NSAMPLES, replace=None, random_state=10),
    lambda : df.sample(frac=NSAMPLES/len(df), replace=None, random_state=10),
    lambda : df.loc[np.random.choice(df.index, size=NSAMPLES, replace=False)],
    ]
for i, f in enumerate(methods):
    print('avg. time method {}: {} s'.format(
        i, timeit.timeit(methods[i], number=NRUNS) / NRUNS))


Exemplary results:

avg. time method 0: 0.21715480241997284 s
avg. time method 1: 0.21541569983994122 s
avg. time method 2: 0.21495854450011392 s

Stefan Scheller
  • 953
  • 1
  • 12
  • 22
  • Thank you for your answer! I don't know why it is so slow. When I do row-wise selections (like df[df.x > 0]), merging, etc it is really fast, but it is very low for other operations like "len(df)" (this takes a while with Dask even if it is very fast with Pandas). But I cannot convert my file into a Pandas DataFrame because it is too big for the memory. – Falco Jul 14 '20 at 09:32
  • 1
    @Pauline I think that is caused by how *Dask* handles the data partitions. Maybe you can accelerate your code by applying some of the Best Practices: https://docs.dask.org/en/latest/best-practices.html – Stefan Scheller Jul 14 '20 at 09:36
  • @Falco, are you doing any operations before the len(df)? Because then Dask will need to execute all those before it can determine the length of df. Try doing a df = df.persist() before the len(df) and see if it still takes so long. – Danferno Sep 26 '22 at 08:51
1

I think the problem might be coming from the len(df) in your first example.

When the len is triggered on the dask dataframe, it tries to compute the total number of rows, which I think might be what's slowing you down.

If you know the length of the dataframe is 6M rows, then I'd suggest changing your first example to be something similar to:

num_rows = len(df)
df_s=df.sample(frac=5000/num_rows, replace=None, random_state=10)

or

df_s=df.sample(frac=5000/6_000_000, replace=None, random_state=10)

If you're absolutely sure you want to use len(df), you might want to consider how you're loading up the dask dataframe in the first place.

For example, if you're reading a single CSV file on disk, then it'll take a fairly long time since the data you'll be working with (assuming all numerical data for the sake of this, and 64-bit float/int data) = 6 Million Rows * 550 Columns * 8 bytes = 26.4 GB.

This is because dask is forced to read all of the data when it's in a CSV format. The problem gets even worse when you consider working with str or some other data type, and you then have to consider disk read the time.

In comparison, working with parquet becomes much easier since the parquet stores file metadata, which generally speeds up the process, and I believe much less data is read.

Ali Abbas
  • 136
  • 1
  • 8