4

I have a dataframe and I want to sample it. However while sampling it randomly I want to have at least 1 sample from every element in the column. I also want the distribution have an effect as well.(ex: values with more samples on the original have more on the sampled df)

Similar to this and this question, but with minimum sample size per group.

Lets say this is my df:

df = pd.DataFrame(columns=['class'])
df['class'] = [0,0,0,0,0,0,0,0,0,0,0,0,0,1,2]
df_sample = df.sample(n=4)

And when I sample this I want the df_sample to look like:

     Class
      0
      0
      1
      2

Thank you.

RichieV
  • 5,103
  • 2
  • 11
  • 24
s900n
  • 3,115
  • 5
  • 27
  • 35
  • How about just use set to get all unique item and then use sample(n=len(set) -prev_n) to sampling from data. – Ando Sep 09 '20 at 12:15

2 Answers2

0

As suggested by @YukiShioriii you could :

1 - sample one row of each group of values

2 - randomly sample over the remaining rows regardless of the values

Mathieu
  • 179
  • 6
0

Following YukiShioriii's and mprouveur's suggestion

# random_state for reproducibility, remove in production code
sample = df.groupby('class').sample(1, random_state=1)

sample = sample.append(
    df[~df.index.isin(sample.index)] # only rows that have not been selected
    .sample(n=sample_size-sample.shape[0]) # sample more rows as needed
).sort_index()

Output

    class
2       0
4       0
13      1
14      2
RichieV
  • 5,103
  • 2
  • 11
  • 24