random sample per group, with min_rows

Question

I have a dataframe and I want to sample it. However while sampling it randomly I want to have at least 1 sample from every element in the column. I also want the distribution have an effect as well.(ex: values with more samples on the original have more on the sampled df)

Similar to this and this question, but with minimum sample size per group.

Lets say this is my df:

df = pd.DataFrame(columns=['class'])
df['class'] = [0,0,0,0,0,0,0,0,0,0,0,0,0,1,2]
df_sample = df.sample(n=4)

And when I sample this I want the df_sample to look like:

Thank you.

How about just use set to get all unique item and then use sample(n=len(set) -prev_n) to sampling from data. — Ando, Sep 09 '20 at 12:15

score 0 · Answer 1 · answered Sep 09 '20 at 15:11

0

As suggested by @YukiShioriii you could :

1 - sample one row of each group of values

2 - randomly sample over the remaining rows regardless of the values

answered Sep 09 '20 at 15:11

Mathieu

179
6

score 0 · Answer 2 · answered Sep 09 '20 at 15:24

Following YukiShioriii's and mprouveur's suggestion

# random_state for reproducibility, remove in production code
sample = df.groupby('class').sample(1, random_state=1)

sample = sample.append(
    df[~df.index.isin(sample.index)] # only rows that have not been selected
    .sample(n=sample_size-sample.shape[0]) # sample more rows as needed
).sort_index()

Output

random sample per group, with min_rows

2 Answers2