3

I'd like to find, not the single largest index, but the n largest indexes for a pandas groupby.agg operation.

My data frame looks like so:

>>> dat = [{"date": datetime.date(2021,1,1), "category": "cats", "var": 1},
       {"date": datetime.date(2021,1,1), "category": "dogs", "var": 0},
       {"date": datetime.date(2021,2,1), "category": "cats", "var": -1},
       {"date": datetime.date(2021,2,1), "category": "dogs", "var": 3},
       ]
>>> df = pd.DataFrame(dat)

>>> df.groupby(["date", "category"]).agg({'var': ['sum']})
                    var
                    sum
date       category    
2021-01-01 cats       1
           dogs       0
2021-02-01 cats      -1
           dogs       3

Get the index where the value of var is largest:

>>> df.groupby(["date", "category"]).agg({'var': ['sum']}).idxmax()

var  sum    (2021-02-01, dogs)
dtype: object

Confirmed, that is correct. I can see that the index, (2021-02-01, dogs) where the value of var is a a max.

So now, I'd like to see not just the largest index, but the nlargest indexes. Here, n=2.

However, the obvious solution to this, nlargest(2, variable_name) seems to fail:

df.groupby(["date", "category"]).agg({'var': ['sum']}).nlargest(2, 'sum')

fails, in what seems to be a multindex failure:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: 'sum'

How can I get the idxmax for the n largest indexes? I assume there is a multiindex I can pass to nlargest or a level, but I don't see it.

Mittenchops
  • 18,633
  • 33
  • 128
  • 246

2 Answers2

4

How about:

(df.groupby(['date','category'])['var'].sum()
   .nlargest(2).index
)

Output:

MultiIndex([('2021-1-1', 'dogs'),
            ('2021-1-1', 'cats')],
           names=['date', 'category'])
Quang Hoang
  • 146,074
  • 10
  • 56
  • 74
3

Following your example:

x = (
    df.groupby(["date", "category"])
    .agg({"var": ["sum"]})
    .nlargest(2, ("var", "sum"))
    .index
)
print(x)

Prints:

MultiIndex([(2021-02-01, 'dogs'),
            (2021-01-01, 'cats')],
           names=['date', 'category'])

But you can also do:

x = (
    df.groupby(["date", "category"])
    .agg({"var": "sum"})             # <--- omit the [ ]
    .nlargest(2, "var")
    .index
)
print(x)

Prints:

MultiIndex([(2021-02-01, 'dogs'),
            (2021-01-01, 'cats')],
           names=['date', 'category'])
Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91