Sampling with conditions using Pandas

Question

Here is my df:

scenario	month	id	type
A	2023-01	A01	HR
A	2023-02	A02	LR
A	2023-04	A04	HR
A	2023-04	A06	HR
B	2023-01	B01	LR
B	2023-02	B02	LR
B	2023-03	B03	HR
B	2023-03	B04	LR
B	2023-03	B05	HR
B	2023-03	B06	HR
B	2023-04	B07	HR

scenario	sample_num
A	2
B	4

I want to take samples based on the scenarios, with the number of samples from each 'month' and 'type' should be equal (or close to each other).

If the required sample size is less than the total number of unique values of 'month', 'month' doesn't matter as long as condition on 'type' is met.

The desired result should be like this:

scenario	month	id	type
A	2023-01	A01	HR
A	2023-02	A02	LR
B	2023-01	B01	LR
B	2023-02	B02	LR
B	2023-03	B03	HR
B	2023-04	B07	HR

I have thought of many solutions, but none really solves the problem.

mozway · Answer 1 · 2023-05-09T09:33:07.513

general logic

Make a mapping Series, and use groupby.apply and sample:

mapper = df2.set_index('scenario')['sample_num']
out = (df.groupby('scenario', group_keys=False)
         .apply(lambda g: g.sample(n=mapper[g.name]))
      )

Or by shuffling the whole DataFrame, then selecting with a groupby.cumcount:

mapper = df2.set_index('scenario')['sample_num']

out = (
 df.sample(frac=1)
   .loc[lambda d: d.groupby('scenario').cumcount().lt(d['scenario'].map(mapper))]
)

Example output:

   scenario    month   id type
2         A  2023-04  A04   HR
3         A  2023-04  A06   HR
10        B  2023-04  B07   HR
6         B  2023-03  B03   HR
7         B  2023-03  B04   LR
5         B  2023-02  B02   LR

taking into account the type

Now keep the same logic and sample 1/n rows where n is the number of types:

n_types = df['type'].nunique()
mapper = df2.set_index('scenario')['sample_num']//n_types

out = (df.groupby(['scenario', 'type'], group_keys=False)
         .apply(lambda g: g.sample(n=mapper[g.name[0]]))
       )

Or:

n_types = df['type'].nunique()
mapper = df2.set_index('scenario')['sample_num']//n_types

(df.sample(frac=1)
   .loc[lambda d: d.groupby(['scenario', 'type'])
                   .cumcount().lt(d['scenario'].map(mapper))]
)

Or with a custom function if the types should be considered independently per group:

mapper = df2.set_index('scenario')['sample_num']

def sample_equal(df, n):
    n_types = df['type'].nunique()
    return df.groupby('type').sample(n=n//n_types)

out = (df.groupby('scenario', group_keys=False)
         .apply(lambda g: sample_equal(g, n=mapper[g.name]))
      )

Example output:

  scenario    month   id type
2        A  2023-04  A04   HR
1        A  2023-02  A02   LR
6        B  2023-03  B03   HR
8        B  2023-03  B05   HR
7        B  2023-03  B04   LR
4        B  2023-01  B01   LR

`I want to take samples based on the scenarios, with the number of samples from each 'month' and 'type' should be equal (or close to each other).` — jezrael, May 09 '23 at 09:19
`n_types = df['type'].nunique()` should be `n_types = df.groupby('scenario')['type'].nunique()` ? — jezrael, May 09 '23 at 09:27
@jezrael I don't think so, unless OP really has very different types, in which case this could be added in the `apply` easily. I added an example. — mozway, May 09 '23 at 09:28

Sampling with conditions using Pandas

1 Answers1

general logic

taking into account the type