0

I have a somewhat big pandas dataframe (100,000x9). The first two columns are a combination of names associated with a value (in both sides). I want to delete the lower value associated with a given combination.

I haven't tried anything yet, because I'm not sure how to tackle this problem. My first impression is that I need to use the apply function over the data frame, but I need to select each combination of 'first' and 'second', compare them and then delete that row.

df = pd.DataFrame(np.array([['John','Mary',5],['John','Mark',1],  ['Mary','John',2], ['Mary','Mark',1], ['Mark','John',3], ['Mark','Mary',5]]), columns=['first','second','third'])

df

first   second  third
0   John    Mary    5
1   John    Mark    1
2   Mary    John    2
3   Mary    Mark    1
4   Mark    John    3
5   Mark    Mary    5

My objective is to get this data frame

df_clean = pd.DataFrame(np.array([['John','Mary',5], ['Mark','John',3], ['Mark','Mary',5]]), columns=['first','second','third'])

df_clean

first   second  third
0   John    Mary    5
1   Mark    John    3
2   Mark    Mary    5

Any ideas?

MatGut
  • 35
  • 4
  • https://stackoverflow.com/questions/51182228/python-delete-duplicates-in-a-dataframe-based-on-two-columns-combinations/51182428#51182428 – BENY Apr 03 '19 at 22:26

1 Answers1

2

First we use np.sort to sort horizontally, then we use groupby with max function to get the highest value per unique value of first, second:

df[['first', 'second']] = np.sort(df[['first', 'second']], axis=1)

print(df.groupby(['first', 'second']).third.max().reset_index())
  first second third
0  John   Mark     3
1  John   Mary     5
2  Mark   Mary     5
Erfan
  • 40,971
  • 8
  • 66
  • 78