Here is my df:
scenario | month | id | type |
---|---|---|---|
A | 2023-01 | A01 | HR |
A | 2023-02 | A02 | LR |
A | 2023-04 | A04 | HR |
A | 2023-04 | A06 | HR |
B | 2023-01 | B01 | LR |
B | 2023-02 | B02 | LR |
B | 2023-03 | B03 | HR |
B | 2023-03 | B04 | LR |
B | 2023-03 | B05 | HR |
B | 2023-03 | B06 | HR |
B | 2023-04 | B07 | HR |
scenario | sample_num |
---|---|
A | 2 |
B | 4 |
I want to take samples based on the scenarios, with the number of samples from each 'month' and 'type' should be equal (or close to each other).
If the required sample size is less than the total number of unique values of 'month', 'month' doesn't matter as long as condition on 'type' is met.
The desired result should be like this:
scenario | month | id | type |
---|---|---|---|
A | 2023-01 | A01 | HR |
A | 2023-02 | A02 | LR |
B | 2023-01 | B01 | LR |
B | 2023-02 | B02 | LR |
B | 2023-03 | B03 | HR |
B | 2023-04 | B07 | HR |
I have thought of many solutions, but none really solves the problem.