23

In the following snippet, data is a pandas.DataFrame and indices is a set of columns of the data. After grouping the data with groupby, I am interested in the ids of the groups, but only those with a size greater than a threshold (say: 3).

group_ids=data.groupby(list(data.columns[list(indices)])).grouper.group_info[0]

Now, how can I find which group has a size greater than or equal 3 knowing the id of the group? I only want ids of groups with a certain size.

#TODO: filter out ids from group_ids which correspond to groups with sizes < 3 
cottontail
  • 10,268
  • 18
  • 50
  • 51
piokuc
  • 25,594
  • 11
  • 72
  • 102

2 Answers2

40

One way is to use the size method of the groupby:

g = data.groupby(...)
size = g.size()
size[size > 3]

For example, here there is only one group of size > 1:

In [11]: df = pd.DataFrame([[1, 2], [3, 4], [1,6]], columns=['A', 'B'])

In [12]: df
Out[12]:
   A  B
0  1  2
1  3  4
2  1  6 

In [13]: g = df.groupby('A')

In [14]: size = g.size()

In [15]: size[size > 1]
Out[15]:
A
1    2
dtype: int64

If you were interested in just restricting the DataFrame to those in large groups you could use the filter method:

In [21]: g.filter(lambda x: len(x) > 1)
Out[21]:
   A  B
0  1  2
2  1  6
Andy Hayden
  • 359,921
  • 101
  • 625
  • 535
0

You can also call value_counts() here to get size of groups.

df = pd.DataFrame({'L1': list('ZXYXYXY'), 'L2': [1, 0, 1, 0, 0, 0, 1]})


   L1  L2
0   Z   1
1   X   0
2   Y   1
3   X   0
4   Y   0
5   X   0
6   Y   1

The basic idea is to get the size of each group and filter the groupers (grp below) that correspond are at least size 3.

grp = 'L1'
size = df.value_counts(grp)
size.index[size>=3]             # Index(['X', 'Y'], dtype='object', name='L1')

If we want to use the group_ids, then numpy.unique() could be useful. The basic idea is to count the unique grouper ids and filter the ones that have at least 3 values. This will give the grouper ids that are at least of size 3.

If we want to look at the group keys that correspond to these values, we can use these indices to filter the group_keys_seq attribute, which is equal to the index filtered using value_counts() above.1

grp = 'L1'
g = df.groupby(grp).grouper
# count unique grouper ids
u, c = np.unique(g.group_info[0], return_counts=True)
idx = u[c >= 3]                 # array([0, 1], dtype=int64)
g.group_keys_seq[idx]           # Index(['X', 'Y'], dtype='object', name='L1')

1 If the groupby is done on multiple columns, then group_keys_seq returns a list of tuples and it can't be indexed like g.group_keys_seq[idx]. In that case, use pd.MultiIndex.from_tuples(g.group_keys_seq)[idx] instead.

cottontail
  • 10,268
  • 18
  • 50
  • 51