71

I know this must have been answered some where but I just could not find it.

Problem: Sample each group after groupby operation.

import pandas as pd

df = pd.DataFrame({'a': [1,2,3,4,5,6,7],
                   'b': [1,1,1,0,0,0,0]})

grouped = df.groupby('b')

# now sample from each group, e.g., I want 30% of each group
cs95
  • 379,657
  • 97
  • 704
  • 746
gongzhitaao
  • 6,566
  • 3
  • 36
  • 44
  • 3
    from pandas 1.1, you can just do `df.groupby('b').sample()`. [Relevant docs](https://pandas.pydata.org/docs/dev/reference/api/pandas.core.groupby.DataFrameGroupBy.sample.html) – cs95 Jul 29 '20 at 10:27

2 Answers2

81

Apply a lambda and call sample with param frac:

In [2]:
df = pd.DataFrame({'a': [1,2,3,4,5,6,7],
                   'b': [1,1,1,0,0,0,0]})
​
grouped = df.groupby('b')
grouped.apply(lambda x: x.sample(frac=0.3))

Out[2]:
     a  b
b        
0 6  7  0
1 2  3  1
EdChum
  • 376,765
  • 198
  • 813
  • 562
50

pandas >= 1.1: GroupBy.sample

This works like magic:

# np.random.seed(0)
df.groupby('b').sample(frac=.3) 

   a  b
5  6  0
0  1  1

pandas <= 1.0.X

You can use GroupBy.apply with sample. You do not need to use a lambda; apply accepts keyword arguments:

df.groupby('b', group_keys=False).apply(pd.DataFrame.sample, frac=.3)

   a  b
5  6  0
0  1  1
cs95
  • 379,657
  • 97
  • 704
  • 746
  • ```df.sample(frac=1).groupby('b').head(2)``` This one is not the same. Sample take samples uniformly, this one first first one. The usage of them depend on task, but the head one depend on sorting order, when sample does not. – melgor89 Jul 17 '20 at 06:47