pandas: groupby two columns and get random selection of groups such that each value in the first column will be represented by a single group

Question

It's similar to this question, but with an additional level of complexity.
In my case, I have a the following dataframe:

import pandas as pd    
df = pd.DataFrame({'col1': list('aaabbbabababbaaa'), 'col2': list('cdddccdsssssddcd'), 'val': range(0, 16)})

output:

   col1 col2  val
0     a    c    0
1     a    d    1
2     a    d    2
3     b    d    3
4     b    c    4
5     b    c    5
6     a    d    6
7     b    s    7
8     a    s    8
9     b    s    9
10    a    s   10
11    b    s   11
12    b    d   12
13    a    d   13
14    a    c   14
15    a    d   15

My goal is to select random groups of groupby(['col1', 'col2']) such that each value of col1 will be selected only once. This can be executed by the following code:

g = df.groupby('col1')
indexes = []
for _, group in g:
    g_ = group.groupby('col2')
    a = np.arange(g_.ngroups)
    np.random.shuffle(a)
    indexes.extend(group[g_.ngroup().isin(a[:1])].index.tolist())

output:

print(df[df.index.isin(indexes)])
   col1 col2  val
4     b    c    4
5     b    c    5
8     a    s    8
10    a    s   10

However, I'm looking for a more concise and pythonic way to solve this.

Ben.T · Accepted Answer · 2021-07-14T15:47:20.097

1

Another option is to sufffle your two columns with sample and drop_duplicates by col1, so that you keep only one couple per col1 value. then merge the result to df to select all the rows with these couples.

print(df.merge(df[['col1','col2']].sample(frac=1).drop_duplicates('col1')))
  col1 col2  val
0    b    s    7
1    b    s    9
2    b    s   11
3    a    s    8
4    a    s   10

or with groupby and sample a bit the same idea but to select only one row per col1 value with merge after

df.merge(df[['col1','col2']].groupby('col1').sample(n=1))

EDIT: to get both the selected rows and the others rows, then you can use the parameter indicator in the merge and do a left merge. then query each separately:

m = df.merge(df[['col1','col2']].groupby('col1').sample(1), how='left', indicator=True)
print(m)
select_ = m.query('_merge=="both"')[df.columns]
print(select_)
comp_ = m.query('_merge=="left_only"')[df.columns]
print(comp_)

edited Jul 14 '21 at 15:47

answered Jul 14 '21 at 15:10

Ben.T

29,160
6
32
54

Why do you need `sample(frac=1)`? I think `df.merge(df[['col1','col2']].drop_duplicates('col1'))` will also work – itamar kanter Jul 14 '21 at 15:29
@itamarkanter using `frac=1` shuffle all the rows, because in the first case, the condition about keeping only one row per value in col1 is done after with drop_duplicates, so you need to shuffle all the rows. whle in the second method, as you already say you want to groupby col1, then you can keep only one row per col1 value – Ben.T Jul 14 '21 at 15:32
I like the use of `df.merge`. original thinking! – itamar kanter Jul 14 '21 at 15:33
@itamarkanter `df.merge(df[['col1','col2']].drop_duplicates('col1'))` this will work to select only one group per value of col1, but it is not random choice as you will keep always the first row of each value in col1 – Ben.T Jul 14 '21 at 15:34
I didn't mention this in the question but actually, I'm also interested in the complementary _dataframe_, any idea how can I do it? – itamar kanter Jul 14 '21 at 15:42

pandas: groupby two columns and get random selection of groups such that each value in the first column will be represented by a single group

1 Answers1

Linked