1

With Pandas, for the following data set

author1,category1,10.00
author1,category2,15.00
author1,category3,12.00
author2,category1,5.00
author2,category2,6.00
author2,category3,4.00
author2,category4,9.00
author3,category1,7.00
author3,category2,4.00
author3,category3,7.00

I would like to get the highest value for each author

author1,category2,15.00
author2,category4,9.00
author3,category1,7.00
author3,category3,7.00

(Apologies, I'm a pandas noob.)

Mike
  • 633
  • 3
  • 9
  • 18

2 Answers2

6
import pandas as pd

df = pd.read_csv("in.csv", names=("Author","Cat","Val"))

print(df.groupby(['Author'])['Val'].max())

To get the df:

inds = df.groupby(['Author'])['Val'].transform(max) == df['Val']
df = df[inds]
df.reset_index(drop=True, inplace=True)
print(df)
    Author        Cat  Val
0  author1  category2   15
1  author2  category4    9
2  author3  category1    7
3  author3  category3    7
Padraic Cunningham
  • 176,452
  • 29
  • 245
  • 321
  • Thanks a lot, this indeed works, but looks more arcane to me than what Jianxun Li proposed: I don't understand why it works (the "To get the df" bit). – Mike Jul 12 '15 at 12:00
2

Since you want to retrieve category column as well, a standard .agg on column val won't give you what you want. (also, since there are two values in author3 are 7, the approach by @Padraic Cunningham using.max() will only return one instance instead of both) You can define a customized apply function to accomplish your task.

import pandas as pd

# your data, assume columns names are: author, cat, val
# ===============================
print(df)


    author        cat  val
0  author1  category1   10
1  author1  category2   15
2  author1  category3   12
3  author2  category1    5
4  author2  category2    6
5  author2  category3    4
6  author2  category4    9
7  author3  category1    7
8  author3  category2    4
9  author3  category3    7

# processing
# ====================================
def func(group):
    return group.loc[group['val'] == group['val'].max()]

df.groupby('author', as_index=False).apply(func).reset_index(drop=True)


    author        cat  val
0  author1  category2   15
1  author2  category4    9
2  author3  category1    7
3  author3  category3    7    
Jianxun Li
  • 24,004
  • 10
  • 58
  • 76
  • Thank you, Jianxun Li. I will need to ask a follow-up question about how to get only the rows where the author only has one most expensive category. – Mike Jul 12 '15 at 12:13
  • @Mike by most expensive, do you mean the category with highest value? If that's the case, I think you just need to select `author` and `cat` columns from the last dataframe. – Jianxun Li Jul 12 '15 at 12:16
  • No, not quite. I have posted my other question here: http://stackoverflow.com/questions/31367672/with-pandas-in-python-select-only-the-rows-where-group-by-group-count-is-1 – Mike Jul 12 '15 at 13:18