5

I have a Dataframe like this:

name phase value
BOB 1 .9
BOB 2 .05
BOB 3 .05
JOHN 2 .45
JOHN 3 .45
JOHN 4 .05
FRANK 1 .4
FRANK 3 .6

I want to find which entry in column 'phase' has the maximum value in column 'value'.
If more than one share the same maximum value keep the first or a random value for 'phase'.


Desired result table:

name phase value
BOB 1 .9
JOHN 2 .45
FRANK 3 .6

my approach was:

df.groupby(['name'])[['phase','value']].max() 

but it returned incorrect values.

Daraan
  • 1,797
  • 13
  • 24

4 Answers4

4

You don't need to use groupby. Sort values by value and phase (adjust the order if necessary) and drop duplicates by name:

out = (df.sort_values(['value', 'phase'], ascending=[False, True])
         .drop_duplicates('name')
         .sort_index(ignore_index=True))
print(out)

# Output
    name  phase  value
0    BOB      1   0.90
1   JOHN      2   0.45
2  FRANK      3   0.60
Corralien
  • 109,409
  • 8
  • 28
  • 52
3

Try to sort the dataframe first:

df = df.sort_values(
    by=["name", "value", "phase"], ascending=[True, False, True]
)

x = df.groupby("name", as_index=False).first()
print(x)

Prints:

    name  phase  value
0    BOB      1   0.90
1  FRANK      1   0.60
2   JOHN      1   0.45
Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91
1

A possible solution, that could avoid sorting is with groupby:

df.loc[df.groupby('name', sort = False).value.idxmax()]

    name  phase  value
0    BOB      1   0.90
3   JOHN      2   0.45
7  FRANK      3   0.60
sammywemmy
  • 27,093
  • 4
  • 17
  • 31
1

You may check

out = df.sort_values('value',ascending=False).drop_duplicates('name').sort_index()
Out[434]: 
    name  phase  value
0    BOB      1   0.90
3   JOHN      2   0.45
7  FRANK      3   0.60
BENY
  • 317,841
  • 20
  • 164
  • 234