1

I have the data in the .csv file uploaded in the following link

Click here for the data

In this file, I have the following columns:

Team    Group    Model   SimStage  Points  GpWinner  GpRunnerup 3rd   4th

There will be duplicates in the columns Team. Another column is SimStage. Simstage has a series containing data from 0 to N (in this case 0 to 4)

I would like to keep row for each team at each Simstage value (i.e. the rest will be removed). when we remove, the duplicates row with lower value in the column Points will be removed for each Team and SimStage. Since it is slightly difficult to explain using words alone, I attached a picture here.

Sample pic that shows which rows to remove

In this picture, the row with highlighted in red boxes will be be removed.

I used df.duplicates() but it does not work.

Seanny123
  • 8,776
  • 13
  • 68
  • 124
Zephyr
  • 1,332
  • 2
  • 13
  • 31
  • Hi @Dillion, Thanks for the reply. I am not removing first 3 rows. only teh row with highlighted in red box will be removed. it is not necessary to be the first row. the columns **points** with lower value will be removed. – Zephyr Jun 22 '18 at 07:29

2 Answers2

2

It looks like you want to only keep the highest value from the 'Points' column. Therefore, use the first aggregation function in pandas

Create the dataframe and call it df

data = {'Team': {0: 'Brazil',  1: 'Brazil',  2: 'Brazil',  3: 'Brazil',  4: 'Brazil',  5: 'Brazil',  6: 'Brazil',  7: 'Brazil',  8: 'Brazil',  9: 'Brazil'},
 'Group': {0: 'Group E',  1: 'Group E',  2: 'Group E',  3: 'Group E',  4: 'Group E',  5: 'Group E',  6: 'Group E',  7: 'Group E',  8: 'Group E',  9: 'Group E'},
 'Model': {0: 'ELO',  1: 'ELO',  2: 'ELO',  3: 'ELO',  4: 'ELO',  5: 'ELO',  6: 'ELO',  7: 'ELO',  8: 'ELO',  9: 'ELO'},
 'SimStage': {0: 0, 1: 0, 2: 1, 3: 1, 4: 2, 5: 2, 6: 3, 7: 3, 8: 4, 9: 4},
 'Points': {0: 4, 1: 4, 2: 4, 3: 4, 4: 4, 5: 1, 6: 2, 7: 4, 8: 4, 9: 1},
 'GpWinner': {0: 0.2,  1: 0.2,  2: 0.2,  3: 0.2,  4: 0.2,  5: 0.0,  6: 0.2,  7: 0.2,  8: 0.2,  9: 0.0},
 'GpRunnerup': {0: 0.0,  1: 0.0,  2: 0.0,  3: 0.0,  4: 0.0,  5: 0.2,  6: 0.0,  7: 0.0,  8: 0.0,  9: 0.2},
 '3rd': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0, 9: 0},
 '4th': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0, 9: 0}}

df = pd.DataFrame(data)

# To be able to output the dataframe in your original order
columns_order = ['Team', 'Group', 'Model', 'SimStage', 'Points', 'GpWinner', 'GpRunnerup', '3rd', '4th']

Method 1

# Sort the values by 'Points' descending and 'SimStage' ascending
df = df.sort_values('Points', ascending=False)
df = df.sort_values('SimStage')

# Group the columns by the necessary columns
df = df.groupby(['Team', 'SimStage'], as_index=False).agg('first')

# Output the dataframe in the orginal order
df[columns_order]

Out[]: 
     Team    Group Model  SimStage  Points  GpWinner  GpRunnerup  3rd  4th
0  Brazil  Group E   ELO         0       4       0.2         0.0    0    0
1  Brazil  Group E   ELO         1       4       0.2         0.0    0    0
2  Brazil  Group E   ELO         2       4       0.2         0.0    0    0
3  Brazil  Group E   ELO         3       4       0.2         0.0    0    0
4  Brazil  Group E   ELO         4       4       0.2         0.0    0    0

Method 2

df.sort_values('Points', ascending=False).drop_duplicates(['Team', 'SimStage'])[columns_order]
Out[]: 
     Team    Group Model  SimStage  Points  GpWinner  GpRunnerup  3rd  4th
0  Brazil  Group E   ELO         0       4       0.2         0.0    0    0
2  Brazil  Group E   ELO         1       4       0.2         0.0    0    0
4  Brazil  Group E   ELO         2       4       0.2         0.0    0    0
7  Brazil  Group E   ELO         3       4       0.2         0.0    0    0
8  Brazil  Group E   ELO         4       4       0.2         0.0    0    0
Dillon
  • 997
  • 4
  • 13
  • Hi @Dillion, it removed the last 4 columns (GpWinner, GpRunnerUp, 3rd and 4th ) as well, I want to keep those columns too. How can i do that? Thanks for the reply. I have been working out this for hours – Zephyr Jun 22 '18 at 07:39
  • The updated solution should work now. Note the change near the end to: `.agg('first')` – Dillon Jun 22 '18 at 07:55
  • Thanks Dillion. I really appreciate your help. it saved me quite a few hours of time spending on it how to solve. – Zephyr Jun 22 '18 at 15:19
1

I am just creating a mini-dataset based on your dataset here with Team, SimStage and Points.

import pandas as pd

namesDf = pd.DataFrame() 
namesDf['Team'] = ['Brazil', 'Brazil', 'Brazil', 'Brazil', 'Brazil', 'Brazil', 'Brazil', 'Brazil', 'Brazil', 'Brazil']
namesDf['SimStage'] = [0, 0, 1, 1, 2, 2, 3, 3, 4, 4]
namesDf['Points'] = [4, 4, 4, 4, 4, 1, 2, 4, 4, 1]

Now, for each Sim Stage, you want the highest Point value. So, I first group them by Team and Sim Stage and then Sort them by Points.

namesDf = namesDf.groupby(['Team', 'SimStage'], as_index = False).apply(lambda x: x.sort_values(['Points'], ascending = False)).reset_index(drop = True)

This will make my dataframe look like this, notice the change in Sim Stage with value 3:

     Team  SimStage  Points
0  Brazil         0       4
1  Brazil         0       4
2  Brazil         1       4
3  Brazil         1       4
4  Brazil         2       4
5  Brazil         2       1
6  Brazil         3       4
7  Brazil         3       2
8  Brazil         4       4
9  Brazil         4       1

And now I remove the duplicates by keeping the first instance of every team and sim stage.

namesDf = namesDf.drop_duplicates(subset=['Team', 'SimStage'], keep = 'first')

Final result:

     Team  SimStage  Points
0  Brazil         0       4
2  Brazil         1       4
4  Brazil         2       4
6  Brazil         3       4
8  Brazil         4       4
Ankur Sinha
  • 6,473
  • 7
  • 42
  • 73