5

I have a very large DataFrame that looks like this example df:

df = 

col1    col2     col3 
apple   red      2.99 
apple   red      2.99 
apple   red      1.99 
apple   pink     1.99 
apple   pink     1.99 
apple   pink     2.99 
...     ....      ...
pear    green     .99 
pear    green     .99 
pear    green    1.29

I am grouping by 2 columns like this:

g = df.groupby(['col1', 'col2'])

Now I want to select say 3 random groups. So my expected output is this:

col1    col2     col3 
apple   red      2.99 
apple   red      2.99 
apple   red      1.99 
pear    green     .99 
pear    green     .99 
pear    green    1.29
lemon   yellow    .99 
lemon   yellow    .99 
lemon   yellow   1.99 

(Let's pretend those above three groups are random groups from df). How can I achieve this? I have using this. But this did not help me in my case.

Ishara Madhawa
  • 3,549
  • 5
  • 24
  • 42
Hana
  • 1,330
  • 4
  • 23
  • 38

6 Answers6

12

You can do with shuffle and ngroup

g = df.groupby(['col1', 'col2'])

a=np.arange(g.ngroups)
np.random.shuffle(a)

df[g.ngroup().isin(a[:2])]# change 2 to what you need :-) 
BENY
  • 317,841
  • 20
  • 164
  • 234
  • when I used this however now I am getting error "TypeError: unsupported operand type(s) for -: 'dict' and 'int'" Do you know why? – Hana Apr 24 '18 at 15:37
  • @Hana here `a = np.arange(g.groups) ` change to `a=np.arange(g.ngroups)` – BENY Apr 24 '18 at 15:41
  • 3
    The group sampling could be done more succinctly (without shuffling a full list) by using `numpy.random.choice` – ie. `df[g.ngroup().isin(choice(g.ngroups, 2, replace=False)]`. – jstol Jun 10 '20 at 17:15
  • @jstol, I like your solution, it's just missing a closing parenthesis: `df[g.ngroup().isin(choice(g.ngroups, 2, replace=False))]` and for those (like me that didn't spot it initially) that `import numpy as np`, the line should read `df[g.ngroup().isin(np.random.choice(g.ngroups, 2, replace=False))]` – s_pike Jan 06 '21 at 12:36
4

Shuffle your dataframe using sample, and then perform a non-sorting groupby:

df = df.sample(frac=1)
df2 = pd.concat(
    [g for _, g in df.groupby(['col1', 'col2'], sort=False, as_index=False)][:3],
    ignore_index=True 
)  

If you need the first 3 per group, use groupby.head(3);

df2 = pd.concat(
    [g.head(3) for _, g in df.groupby(['col1', 'col2'], sort=False, as_index=False)][:3],
    ignore_index=True 
)     
cs95
  • 379,657
  • 97
  • 704
  • 746
2

In cases where you need to do this type of sampling in only one column, this is also an alternative:

df.loc[df['col1'].isin(pd.Series(df['col1'].unique()).sample(2))]

longer:

>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame({'col1':['a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'c'],
                      'col2': np.random.randint(5, size=9),
                      'col3': np.random.randint(5, size=9)
                     })
>>> df
  col1  col2  col3
0    a     4     3
1    a     3     0
2    a     4     0
3    b     4     4
4    b     4     1
5    b     1     3
6    c     4     4
7    c     3     2
8    c     3     1
>>> sample = pd.Series(df['col1'].unique()).sample(2)
>>> sample
0    b
1    c
dtype: object
>>> df.loc[df['col1'].isin(sample)]
  col1  col2  col3
3    b     4     4
4    b     4     1
5    b     1     3
6    c     4     4
7    c     3     2
8    c     3     1
1

This is one way:

from io import StringIO
import pandas as pd
import numpy as np

np.random.seed(100)

data = """
col1    col2     col3
apple   red      2.99
apple   red      2.99
apple   red      1.99
apple   pink     1.99
apple   pink     1.99
apple   pink     2.99
pear    green     .99
pear    green     .99
pear    green    1.29
"""
# Number of groups
K = 2

df = pd.read_table(StringIO(data), sep=' ', skip_blank_lines=True, skipinitialspace=True)
# Use columns as indices
df2 = df.set_index(['col1', 'col2'])
# Choose random sample of indices
idx = np.random.choice(df2.index.unique(), K, replace=False)
# Select
selection = df2.loc[idx].reset_index(drop=False)
print(selection)

Output:

    col1   col2  col3
0  apple   pink  1.99
1  apple   pink  1.99
2  apple   pink  2.99
3   pear  green  0.99
4   pear  green  0.99
5   pear  green  1.29
jdehesa
  • 58,456
  • 7
  • 77
  • 121
1

A simple solution in the spirit of this answer

n_groups = 2    
df.merge(df[['col1','col2']].drop_duplicates().sample(n=n_groups))
itamar kanter
  • 1,170
  • 3
  • 10
  • 25
0

I turned @Arvid Baarnhielm's answer into a simple function

def sampleCluster(df:pd.DataFrame, columnCluster:str, fraction) -> pd.DataFrame:
    return df.loc[df[columnCluster].isin(pd.Series(df[columnCluster].unique()).sample(frac=fraction))]
Danferno
  • 472
  • 5
  • 12