Sampling n= 2000 from a Dask Dataframe of len 18000 generates error Cannot take a larger sample than population when 'replace=False'

Question

I have a dask dataframe created from a csv file and len(daskdf) returns 18000 but when I ddSample = daskdf.sample(2000) I get the error

ValueError: Cannot take a larger sample than population when 'replace=False'

Can I sample without replacement if the dataframe is larger than the sample size?

score 16 · Accepted Answer · edited Jun 30 '22 at 09:39

16

The sample method only supports the frac= keyword argument. See the API documentation

The error that you're getting is from Pandas, not Dask.

In [1]: import pandas as pd
In [2]: df = pd.DataFrame({'x': [1]})
In [3]: df.sample(frac=2000, replace=False)
ValueError: Cannot take a larger sample than population when 'replace=False'

Solution 1

As the Pandas error suggests, consider sampling with replacement

In [4]: df.sample(frac=2, replace=True)
Out[4]: 
   x
0  1
0  1

In [5]: import dask.dataframe as dd
In [6]: ddf = dd.from_pandas(df, npartitions=1)
In [7]: ddf.sample(frac=2, replace=True).compute()
Out[7]: 
   x
0  1
0  1

Solution 2

This may help someone..

I found this from some place and cannot remember where.

This will show you the results correctly without error. (This is for pandas, and I don't know about dask).

import pandas as pd

df = pd.DataFrame({'a': [1,2,3,4,5,6,7],
                   'b': [1,1,1,2,2,3,3]})

# this is fixed number, will be error when data in group is less than sample size
df.groupby('b').apply(pd.DataFrame.sample, n=1)

# this is flexible with min, no error, will return 3 or less than that
df.groupby(['b'], as_index=False, group_keys=False
          ).apply(
            lambda x: x.sample(min(3, len(x)))
        )

edited Jun 30 '22 at 09:39

ihightower

3,093
6
34
49

answered Aug 26 '16 at 23:44

MRocklin

55,641
23
163
235

Sorry but I'm still having trouble understanding it `sampledf = df.sample(frac=2000)` still generates the error if I `sampledf.head()` – mobcdi Aug 27 '16 at 00:07
1

Yes, pandas can't generate a larger sample than the population. Frac needs to be less than one, like `df.sample(frac=0.10)`. Alternatively perhaps you want to set `replace=True`? – MRocklin Aug 27 '16 at 00:38
1

is there anyway i want to use `n=5` and `replace=False`, and pandas should automatically choose random sample with a `max limit of 5`. if dataframe has only 1 record, then just show 1 record instead of 5. I don't want to use replace=True to repeat the record 5 times. I don't want repeats. I don't want to use frac either, as i know i don't want more than 5. – ihightower Feb 11 '22 at 16:11
@ihightower I am facing same problem. If you found solution can you please share – 10sha25 Apr 21 '22 at 14:45
@10sha25 hi i have added my answer further below.. and added as solution 2 above. hope this helps. – ihightower Jun 30 '22 at 09:37
```frac``` argument should be given a value between 0 to 1. The answer is best modified as a fraction to get the size you want. perhaps like 2000/18000. – t T s Sep 22 '22 at 19:43

score 1 · Answer 2 · answered Jun 30 '22 at 09:35

I found this from some place and cannot remember where.

This will show you the results correctly without error. (This is for pandas, and I don't know about dask).

import pandas as pd

df = pd.DataFrame({'a': [1,2,3,4,5,6,7],
                   'b': [1,1,1,2,2,3,3]})

# this is fixed number, will be error when data in group is less than sample size
df.groupby('b').apply(pd.DataFrame.sample, n=1)

# this is flexible with min, no error, will return 3 or less than that
df.groupby(['b'], as_index=False, group_keys=False
          ).apply(
            lambda x: x.sample(min(3, len(x)))
        )

score 0 · Answer 3 · answered Aug 18 '21 at 14:01

0

In the sample method, change parameter replace as True.

df.sample(samples, replace=True)

It is indicated, that the size of DataFrame is larger than the number of samples they need. So this is a temporary workaround.

answered Aug 18 '21 at 14:01

Sathiamoorthy

8,831
9
65
77

score 0 · Answer 4 · answered Sep 13 '21 at 20:43

0

Maybe the point is that he wants to extract a sample of rows from the original dataframe, so imho I think you should specify axis=0 to sample from rows.

answered Sep 13 '21 at 20:43

Luis Correia

19
3

Sampling n= 2000 from a Dask Dataframe of len 18000 generates error Cannot take a larger sample than population when 'replace=False'

4 Answers4

Solution 1

Solution 2

Linked