3
import pandas as pd

df = pd.DataFrame({'a': [1,2,3,4,5,6,7],
                   'b': [1,1,1,0,0,0,0],
})

grouped = df.groupby('b')

now sample from each group, e.g., I want 30% from group b = 1, and 20% from group b = 0. How should I do that? if I want to have 150% for some group, can i do that?

M-Chen-3
  • 2,036
  • 5
  • 13
  • 34
double
  • 59
  • 6
  • 3
    what do you mean by 20% and 30%? – U13-Forward Dec 21 '20 at 03:01
  • Do you mean you want to get a random 20% of the items from group 0 and 30% from group 1? You can do that but since your groups are so small, for this sample data it will only be one item from each group. – BrenBarn Dec 21 '20 at 04:02

2 Answers2

2

You can dynamically return a random sample dataframe with different % of samples as defined per group. You can do this with percentages below 100% (see example 1) AND above 100% (see example 2) by passing replace=True:

  1. Using np.select, create a new column c that returns the number of rows per group to be sampled randomly according to a 20%, 40%, etc. percentage that you set.
  2. From there, you can sample x rows per group based off these percentage conditions. From these rows, return the .index of the rows and filter for the rows with .loc as well as columns 'a','b'. The code grouped.apply(lambda x: x['c'].sample(frac=x['c'].iloc[0])) creates a multiindex series of the output you are looking for, but it requires some cleanup. This is why for me it is just easier to grab the .index and filter the original dataframe with .loc, rather than try to clean up the messy multiindex series.

grouped = df.groupby('b', group_keys=False)
df['c'] = np.select([df['b'].eq(0), df['b'].eq(1)], [0.4, 0.2])
df.loc[grouped.apply(lambda x: x['c'].sample(frac=x['c'].iloc[0])).index, ['a','b']]
Out[1]: 
   a  b
6  7  0
8  9  0
3  4  1

If you would like to return a larger random sample using duplicates of the existing cvalues, simply pass replace=True. Then, do some cleanup to get the output.

grouped = df.groupby('b', group_keys=False)
v = df['b'].value_counts()
df['c'] = np.select([df['b'].eq(0), df['b'].eq(1)],
                    [int(v.loc[0] * 1.2), int(v.loc[1] * 2)]) #frac parameter doesn't work with sample when frac > 1, so we have to calcualte the integer value for number of rows to be sampled.
(grouped.apply(lambda x: x['b'].sample(x['c'].iloc[0], replace=True))
        .reset_index()
        .rename({'index' : 'a'}, axis=1))
Out[2]: 
    a  b
0   7  0
1   8  0
2   9  0
3   7  0
4   7  0
5   8  0
6   1  1
7   3  1
8   3  1
9   1  1
10  0  1
11  0  1
12  4  1
13  2  1
14  3  1
15  0  1
David Erickson
  • 16,433
  • 2
  • 19
  • 35
  • Not sure there is much reason to use `np.select`. You can do just do `df.loc[db.b == 0, 'c'] = 0.3` and similar for the other group. No need to convert the fraction to integer as `sample` accepts a `frac` argument that can take a proportion. – BrenBarn Dec 21 '20 at 04:49
  • @BrenBarn `np.select` is because there are multiple conditions. `.loc` is one way but `np.select` is generally better for multiple conditions. You would have to write `.loc` on multiple lines of code. There is nothing wrong with that. It is just a different approach. I like your `.get_group(0)` code though. – David Erickson Dec 21 '20 at 04:52
  • @BrenBarn Nvmd, I see what you mean now with `frac`. Thanks. This doesn't work though if you specify a `frac` > 1. I think this now also makes the `np.select` a little bit more cleaner than `.loc` in my opinion. – David Erickson Dec 21 '20 at 05:15
  • What does `.reset_index().rename({'index' : 'a'}, axis=1)` do? @DavidErickson – double Dec 21 '20 at 06:07
  • @user14862671 that cleans up the dataframe. `.reset_index()` brings in column `a` to the dataframe, but it is called `index`, so you also have to use `rename`. Passing `axis=1` means you are making name changes on columns rather than rows. – David Erickson Dec 21 '20 at 06:10
  • The code works! A quick follow up, I got a new column 'level_1' generated after I applied it in my dataframe. Any idea what does this mean? @DavidErickson – double Dec 21 '20 at 06:57
  • 1
    @user14862671 that is what happens when you use `reset_index()`. Try: `.rename({'level_1' : 'a'}, axis=1))` instead of ``.rename({'index' : 'a'}, axis=1))`` – David Erickson Dec 21 '20 at 07:14
1

You can get a DataFrame from the GroupBy object with, e.g. grouped.get_group(0). If you want to sample from that you can use the .sample method. For instance grouped.get_group(0).sample(frac=0.2) gives:

   a
5  6

For the example you give both samples will only give one element because the groups have 4 and 3 elements and 0.2*4 = 0.8 and 0.3*3 = 0.9 both round to 1.

BrenBarn
  • 242,874
  • 37
  • 412
  • 384