1

I'm working with Python Pandas trying to sort some student testing data. On occasion, students will test twice during the same testing window, and I want to save only the highest of the two tests. Here's an example of my dataset.

Name Score
Alice 32
Alice 75
John 89
Mark 40
Mark 70
Amy 60

Any ideas of how can I save only the higher score for each student?

Ms.Rog
  • 21
  • 2

4 Answers4

0

You can do this with groupby. It works like this:

df.groupby('Name').agg({'Score': 'max'})

It results in:

       Score
Name        
Alice     75
Amy       60
John      89
Mark      70

Btw. in that special setup, you could also use drop_duplicates to make the name unique after sorting on the score. This would yield the same result, but would not be extensible (e.g. if you later would like to add the average score etc). It would look like this:

df.sort_values(['Name', 'Score']).drop_duplicates(['Name'], keep='last')

From the test data you posted:

import pandas as pd
from io import StringIO

sio= StringIO("""Name Score
Alice 32
Alice 75
John 89
Mark 40
Mark 70
Amy 60 """)

df= pd.read_csv(sio, sep='\s+')
jottbe
  • 4,228
  • 1
  • 15
  • 31
  • I appreciate your edit, but it is, except for the sorting order, exactly what my answer proposes. :) – JE_Muc Nov 28 '20 at 17:18
  • Maybe this is because it is straightforward. We probably won't be nominated for a nobel prize, will we? Btw. Haven't seen your post. Where is it? – jottbe Nov 28 '20 at 17:20
  • Take it easy. This was not meant to be an offense. :) And considering your last question: I have no idea how fast you can type. :) – JE_Muc Nov 28 '20 at 17:24
0

This question has already been answered here on StackOverflow.

You can merge two pandas data frames and after that calculate the maximum number in each row. df1 and df2 are the pandas of students score:

import pandas as pd
df1 = pd.DataFrame({'Alice': 3,
                    'John': 8,
                    'Mark': 7.5,
                    'Amy': 0},
                    index=[0])

df2 = pd.DataFrame({'Alice': 7,
                    'Mark': 7},
                    index=[0])

result = pd.concat([df1, df2], sort=True)
result = result.T
result["maxvalue"] = result.max(axis=1)
Timus
  • 10,974
  • 5
  • 14
  • 28
ZahraRezaei
  • 251
  • 2
  • 14
0

If your data is in the dataframe df, you can sort by the score in descencing order and drop duplicate names, keeping the first:

df.sort_values(by='Score', ascending=False).drop_duplicates(subset='Name', keep='first')
JE_Muc
  • 5,403
  • 2
  • 26
  • 41
0

There are multiple ways to do that, two of them are:

In [8]: df = pd.DataFrame({"Score" : [32, 75, 89, 40, 70, 60],
   ...:             "Name" : ["Alice", "Alice", "John", "Mark", "Mark", "Amy"]})
   ...: df
Out[8]: 
   Score   Name
0     32  Alice
1     75  Alice
2     89   John
3     40   Mark
4     70   Mark
5     60    Amy

In [13]: %time df.groupby("Name").max()
CPU times: user 2.26 ms, sys: 286 µs, total: 2.54 ms
Wall time: 2.11 ms
Out[13]: 
       Score
Name        
Alice     75
Amy       60
John      89
Mark      70

In [14]: %time df.sort_values("Name").drop_duplicates(subset="Name", keep="last")
CPU times: user 2.25 ms, sys: 0 ns, total: 2.25 ms
Wall time: 1.89 ms
Out[14]: 
   Score   Name
1     75  Alice
5     60    Amy
2     89   John
4     70   Mark


Amir saleem
  • 1,404
  • 1
  • 8
  • 11