0

Here is my df:

scenario month id type
A 2023-01 A01 HR
A 2023-02 A02 LR
A 2023-04 A04 HR
A 2023-04 A06 HR
B 2023-01 B01 LR
B 2023-02 B02 LR
B 2023-03 B03 HR
B 2023-03 B04 LR
B 2023-03 B05 HR
B 2023-03 B06 HR
B 2023-04 B07 HR
scenario sample_num
A 2
B 4

I want to take samples based on the scenarios, with the number of samples from each 'month' and 'type' should be equal (or close to each other).

If the required sample size is less than the total number of unique values of 'month', 'month' doesn't matter as long as condition on 'type' is met.

The desired result should be like this:

scenario month id type
A 2023-01 A01 HR
A 2023-02 A02 LR
B 2023-01 B01 LR
B 2023-02 B02 LR
B 2023-03 B03 HR
B 2023-04 B07 HR

I have thought of many solutions, but none really solves the problem.

1 Answers1

0

general logic

Make a mapping Series, and use groupby.apply and sample:

mapper = df2.set_index('scenario')['sample_num']
out = (df.groupby('scenario', group_keys=False)
         .apply(lambda g: g.sample(n=mapper[g.name]))
      )

Or by shuffling the whole DataFrame, then selecting with a groupby.cumcount:

mapper = df2.set_index('scenario')['sample_num']

out = (
 df.sample(frac=1)
   .loc[lambda d: d.groupby('scenario').cumcount().lt(d['scenario'].map(mapper))]
)

Example output:

   scenario    month   id type
2         A  2023-04  A04   HR
3         A  2023-04  A06   HR
10        B  2023-04  B07   HR
6         B  2023-03  B03   HR
7         B  2023-03  B04   LR
5         B  2023-02  B02   LR

taking into account the type

Now keep the same logic and sample 1/n rows where n is the number of types:

n_types = df['type'].nunique()
mapper = df2.set_index('scenario')['sample_num']//n_types

out = (df.groupby(['scenario', 'type'], group_keys=False)
         .apply(lambda g: g.sample(n=mapper[g.name[0]]))
       )

Or:

n_types = df['type'].nunique()
mapper = df2.set_index('scenario')['sample_num']//n_types

(df.sample(frac=1)
   .loc[lambda d: d.groupby(['scenario', 'type'])
                   .cumcount().lt(d['scenario'].map(mapper))]
)

Or with a custom function if the types should be considered independently per group:

mapper = df2.set_index('scenario')['sample_num']

def sample_equal(df, n):
    n_types = df['type'].nunique()
    return df.groupby('type').sample(n=n//n_types)

out = (df.groupby('scenario', group_keys=False)
         .apply(lambda g: sample_equal(g, n=mapper[g.name]))
      )

Example output:

  scenario    month   id type
2        A  2023-04  A04   HR
1        A  2023-02  A02   LR
6        B  2023-03  B03   HR
8        B  2023-03  B05   HR
7        B  2023-03  B04   LR
4        B  2023-01  B01   LR
mozway
  • 194,879
  • 13
  • 39
  • 75
  • `I want to take samples based on the scenarios, with the number of samples from each 'month' and 'type' should be equal (or close to each other).` – jezrael May 09 '23 at 09:19
  • `n_types = df['type'].nunique()` should be `n_types = df.groupby('scenario')['type'].nunique()` ? – jezrael May 09 '23 at 09:27
  • @jezrael I don't think so, unless OP really has very different types, in which case this could be added in the `apply` easily. I added an example. – mozway May 09 '23 at 09:28