Select sample random groups after groupby in pandas?

Question

I have a very large DataFrame that looks like this example df:

df = 

col1    col2     col3 
apple   red      2.99 
apple   red      2.99 
apple   red      1.99 
apple   pink     1.99 
apple   pink     1.99 
apple   pink     2.99 
...     ....      ...
pear    green     .99 
pear    green     .99 
pear    green    1.29

I am grouping by 2 columns like this:

g = df.groupby(['col1', 'col2'])

Now I want to select say 3 random groups. So my expected output is this:

col1    col2     col3 
apple   red      2.99 
apple   red      2.99 
apple   red      1.99 
pear    green     .99 
pear    green     .99 
pear    green    1.29
lemon   yellow    .99 
lemon   yellow    .99 
lemon   yellow   1.99

(Let's pretend those above three groups are random groups from df). How can I achieve this? I have using this. But this did not help me in my case.

You want only 3 groups, or only 3 items per group? Or both? – cs95 Apr 24 '18 at 15:12 — cs95, Apr 24 '18 at 15:12

score 12 · Accepted Answer · answered Apr 24 '18 at 15:13

12

You can do with shuffle and ngroup

g = df.groupby(['col1', 'col2'])

a=np.arange(g.ngroups)
np.random.shuffle(a)

df[g.ngroup().isin(a[:2])]# change 2 to what you need :-)

answered Apr 24 '18 at 15:13

BENY

317,841
20
164
234

when I used this however now I am getting error "TypeError: unsupported operand type(s) for -: 'dict' and 'int'" Do you know why? – Hana Apr 24 '18 at 15:37
@Hana here `a = np.arange(g.groups) ` change to `a=np.arange(g.ngroups)` – BENY Apr 24 '18 at 15:41
3

The group sampling could be done more succinctly (without shuffling a full list) by using `numpy.random.choice` – ie. `df[g.ngroup().isin(choice(g.ngroups, 2, replace=False)]`. – jstol Jun 10 '20 at 17:15
@jstol, I like your solution, it's just missing a closing parenthesis: `df[g.ngroup().isin(choice(g.ngroups, 2, replace=False))]` and for those (like me that didn't spot it initially) that `import numpy as np`, the line should read `df[g.ngroup().isin(np.random.choice(g.ngroups, 2, replace=False))]` – s_pike Jan 06 '21 at 12:36

cs95 · Answer 2 · 2018-04-24T15:20:47.627

Shuffle your dataframe using sample, and then perform a non-sorting groupby:

df = df.sample(frac=1)
df2 = pd.concat(
    [g for _, g in df.groupby(['col1', 'col2'], sort=False, as_index=False)][:3],
    ignore_index=True 
)

If you need the first 3 per group, use groupby.head(3);

df2 = pd.concat(
    [g.head(3) for _, g in df.groupby(['col1', 'col2'], sort=False, as_index=False)][:3],
    ignore_index=True 
)

score 2 · Answer 3 · answered Sep 20 '18 at 14:47

In cases where you need to do this type of sampling in only one column, this is also an alternative:

df.loc[df['col1'].isin(pd.Series(df['col1'].unique()).sample(2))]

longer:

>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame({'col1':['a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'c'],
                      'col2': np.random.randint(5, size=9),
                      'col3': np.random.randint(5, size=9)
                     })
>>> df
  col1  col2  col3
0    a     4     3
1    a     3     0
2    a     4     0
3    b     4     4
4    b     4     1
5    b     1     3
6    c     4     4
7    c     3     2
8    c     3     1
>>> sample = pd.Series(df['col1'].unique()).sample(2)
>>> sample
0    b
1    c
dtype: object
>>> df.loc[df['col1'].isin(sample)]
  col1  col2  col3
3    b     4     4
4    b     4     1
5    b     1     3
6    c     4     4
7    c     3     2
8    c     3     1

score 1 · Answer 4 · answered Apr 24 '18 at 15:01

This is one way:

from io import StringIO
import pandas as pd
import numpy as np

np.random.seed(100)

data = """
col1    col2     col3
apple   red      2.99
apple   red      2.99
apple   red      1.99
apple   pink     1.99
apple   pink     1.99
apple   pink     2.99
pear    green     .99
pear    green     .99
pear    green    1.29
"""
# Number of groups
K = 2

df = pd.read_table(StringIO(data), sep=' ', skip_blank_lines=True, skipinitialspace=True)
# Use columns as indices
df2 = df.set_index(['col1', 'col2'])
# Choose random sample of indices
idx = np.random.choice(df2.index.unique(), K, replace=False)
# Select
selection = df2.loc[idx].reset_index(drop=False)
print(selection)

Output:

    col1   col2  col3
0  apple   pink  1.99
1  apple   pink  1.99
2  apple   pink  2.99
3   pear  green  0.99
4   pear  green  0.99
5   pear  green  1.29

score 1 · Answer 5 · answered Jul 15 '21 at 05:02

1

A simple solution in the spirit of this answer

n_groups = 2    
df.merge(df[['col1','col2']].drop_duplicates().sample(n=n_groups))

answered Jul 15 '21 at 05:02

itamar kanter

1,170
3
10
25

score 0 · Answer 6 · answered Feb 17 '21 at 10:06

0

I turned @Arvid Baarnhielm's answer into a simple function

def sampleCluster(df:pd.DataFrame, columnCluster:str, fraction) -> pd.DataFrame:
    return df.loc[df[columnCluster].isin(pd.Series(df[columnCluster].unique()).sample(frac=fraction))]

answered Feb 17 '21 at 10:06

Danferno

472
5
12

Select sample random groups after groupby in pandas?

6 Answers6

Linked

Related