using duplicates values from one column to remove entire row in pandas dataframe

Question

I have the data in the .csv file uploaded in the following link

In this file, I have the following columns:

Team    Group    Model   SimStage  Points  GpWinner  GpRunnerup 3rd   4th

There will be duplicates in the columns Team. Another column is SimStage. Simstage has a series containing data from 0 to N (in this case 0 to 4)

I would like to keep row for each team at each Simstage value (i.e. the rest will be removed). when we remove, the duplicates row with lower value in the column Points will be removed for each Team and SimStage. Since it is slightly difficult to explain using words alone, I attached a picture here.

In this picture, the row with highlighted in red boxes will be be removed.

I used df.duplicates() but it does not work.

Hi @Dillion, Thanks for the reply. I am not removing first 3 rows. only teh row with highlighted in red box will be removed. it is not necessary to be the first row. the columns **points** with lower value will be removed. — Zephyr, Jun 22 '18 at 07:29

Dillon · Accepted Answer · 2018-06-22T07:56:16.703

It looks like you want to only keep the highest value from the 'Points' column. Therefore, use the first aggregation function in pandas

Create the dataframe and call it df

data = {'Team': {0: 'Brazil',  1: 'Brazil',  2: 'Brazil',  3: 'Brazil',  4: 'Brazil',  5: 'Brazil',  6: 'Brazil',  7: 'Brazil',  8: 'Brazil',  9: 'Brazil'},
 'Group': {0: 'Group E',  1: 'Group E',  2: 'Group E',  3: 'Group E',  4: 'Group E',  5: 'Group E',  6: 'Group E',  7: 'Group E',  8: 'Group E',  9: 'Group E'},
 'Model': {0: 'ELO',  1: 'ELO',  2: 'ELO',  3: 'ELO',  4: 'ELO',  5: 'ELO',  6: 'ELO',  7: 'ELO',  8: 'ELO',  9: 'ELO'},
 'SimStage': {0: 0, 1: 0, 2: 1, 3: 1, 4: 2, 5: 2, 6: 3, 7: 3, 8: 4, 9: 4},
 'Points': {0: 4, 1: 4, 2: 4, 3: 4, 4: 4, 5: 1, 6: 2, 7: 4, 8: 4, 9: 1},
 'GpWinner': {0: 0.2,  1: 0.2,  2: 0.2,  3: 0.2,  4: 0.2,  5: 0.0,  6: 0.2,  7: 0.2,  8: 0.2,  9: 0.0},
 'GpRunnerup': {0: 0.0,  1: 0.0,  2: 0.0,  3: 0.0,  4: 0.0,  5: 0.2,  6: 0.0,  7: 0.0,  8: 0.0,  9: 0.2},
 '3rd': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0, 9: 0},
 '4th': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0, 9: 0}}

df = pd.DataFrame(data)

# To be able to output the dataframe in your original order
columns_order = ['Team', 'Group', 'Model', 'SimStage', 'Points', 'GpWinner', 'GpRunnerup', '3rd', '4th']

Method 1

# Sort the values by 'Points' descending and 'SimStage' ascending
df = df.sort_values('Points', ascending=False)
df = df.sort_values('SimStage')

# Group the columns by the necessary columns
df = df.groupby(['Team', 'SimStage'], as_index=False).agg('first')

# Output the dataframe in the orginal order
df[columns_order]

Out[]: 
     Team    Group Model  SimStage  Points  GpWinner  GpRunnerup  3rd  4th
0  Brazil  Group E   ELO         0       4       0.2         0.0    0    0
1  Brazil  Group E   ELO         1       4       0.2         0.0    0    0
2  Brazil  Group E   ELO         2       4       0.2         0.0    0    0
3  Brazil  Group E   ELO         3       4       0.2         0.0    0    0
4  Brazil  Group E   ELO         4       4       0.2         0.0    0    0

Method 2

df.sort_values('Points', ascending=False).drop_duplicates(['Team', 'SimStage'])[columns_order]
Out[]: 
     Team    Group Model  SimStage  Points  GpWinner  GpRunnerup  3rd  4th
0  Brazil  Group E   ELO         0       4       0.2         0.0    0    0
2  Brazil  Group E   ELO         1       4       0.2         0.0    0    0
4  Brazil  Group E   ELO         2       4       0.2         0.0    0    0
7  Brazil  Group E   ELO         3       4       0.2         0.0    0    0
8  Brazil  Group E   ELO         4       4       0.2         0.0    0    0

Hi @Dillion, it removed the last 4 columns (GpWinner, GpRunnerUp, 3rd and 4th ) as well, I want to keep those columns too. How can i do that? Thanks for the reply. I have been working out this for hours — Zephyr, Jun 22 '18 at 07:39
The updated solution should work now. Note the change near the end to: `.agg('first')` — Dillon, Jun 22 '18 at 07:55
Thanks Dillion. I really appreciate your help. it saved me quite a few hours of time spending on it how to solve. — Zephyr, Jun 22 '18 at 15:19

Ankur Sinha · Answer 2 · 2018-06-22T07:45:15.943

I am just creating a mini-dataset based on your dataset here with Team, SimStage and Points.

import pandas as pd

namesDf = pd.DataFrame() 
namesDf['Team'] = ['Brazil', 'Brazil', 'Brazil', 'Brazil', 'Brazil', 'Brazil', 'Brazil', 'Brazil', 'Brazil', 'Brazil']
namesDf['SimStage'] = [0, 0, 1, 1, 2, 2, 3, 3, 4, 4]
namesDf['Points'] = [4, 4, 4, 4, 4, 1, 2, 4, 4, 1]

Now, for each Sim Stage, you want the highest Point value. So, I first group them by Team and Sim Stage and then Sort them by Points.

namesDf = namesDf.groupby(['Team', 'SimStage'], as_index = False).apply(lambda x: x.sort_values(['Points'], ascending = False)).reset_index(drop = True)

This will make my dataframe look like this, notice the change in Sim Stage with value 3:

     Team  SimStage  Points
0  Brazil         0       4
1  Brazil         0       4
2  Brazil         1       4
3  Brazil         1       4
4  Brazil         2       4
5  Brazil         2       1
6  Brazil         3       4
7  Brazil         3       2
8  Brazil         4       4
9  Brazil         4       1

And now I remove the duplicates by keeping the first instance of every team and sim stage.

namesDf = namesDf.drop_duplicates(subset=['Team', 'SimStage'], keep = 'first')

Final result:

     Team  SimStage  Points
0  Brazil         0       4
2  Brazil         1       4
4  Brazil         2       4
6  Brazil         3       4
8  Brazil         4       4

Thanks for your help and I really appreciate it. – Zephyr Jun 22 '18 at 15:18 — Zephyr, Jun 22 '18 at 15:18

using duplicates values from one column to remove entire row in pandas dataframe

2 Answers2

Method 1

Method 2