It's similar to this question, but with an additional level of complexity.
In my case, I have a the following dataframe:
import pandas as pd
df = pd.DataFrame({'col1': list('aaabbbabababbaaa'), 'col2': list('cdddccdsssssddcd'), 'val': range(0, 16)})
output:
col1 col2 val
0 a c 0
1 a d 1
2 a d 2
3 b d 3
4 b c 4
5 b c 5
6 a d 6
7 b s 7
8 a s 8
9 b s 9
10 a s 10
11 b s 11
12 b d 12
13 a d 13
14 a c 14
15 a d 15
My goal is to select random groups of groupby(['col1', 'col2'])
such that each value of col1
will be selected only once.
This can be executed by the following code:
g = df.groupby('col1')
indexes = []
for _, group in g:
g_ = group.groupby('col2')
a = np.arange(g_.ngroups)
np.random.shuffle(a)
indexes.extend(group[g_.ngroup().isin(a[:1])].index.tolist())
output:
print(df[df.index.isin(indexes)])
col1 col2 val
4 b c 4
5 b c 5
8 a s 8
10 a s 10
However, I'm looking for a more concise and pythonic way to solve this.