1

I have a large multi-index dataframe with about 3 million data points. The first index is the date, the second is the name of the object. Then, each object has a value like so:

df = pd.DataFrame({'year': [2012, 2012, 2012, 2013, 2013, 2013],
'item':['apples', 'metals', 'water', 'apples', 'metals', 'water'],
'value': [10, 14, 16, 17, 13, 14]})

df.set_index(['year', 'item'], inplace = True)

This is a smaller example of the dataframe I'm working with. I want to return a dataframe that keeps the exact same format, but only has the top two values from each year.

So the final dataframe should output metals and water from 2012 with their value, and apples and water from 2013 with their respective value, and in the same format as the original dataframe.

mdawg
  • 49
  • 5

2 Answers2

4

IIUC sort_values before groupby tail

df.sort_values('value').groupby(level=0).tail(2).sort_index()
Out[258]: 
             value
year item         
2012 metals     14
     water      16
2013 apples     17
     water      14
BENY
  • 317,841
  • 20
  • 164
  • 234
2

Applying nlargest

pd.concat([d.nlargest(2, columns=['value']) for _, d in df.groupby('year')])

             value
year item         
2012 water      16
     metals     14
2013 apples     17
     water      14
piRSquared
  • 285,575
  • 57
  • 475
  • 624