How to drop duplicate rows based on values of two columns?

Question

I have a data frame like this:

Category Date_1       Score_1    Date_2           Score_2
  A      13/11/2019    5        13/11/2019        10
  A      13/11/2019    5        14/11/2019        55
  A      13/11/2019    5        15/11/2019        45
  A      13/11/2019    5        16/11/2019        80
  A      14/11/2019    3        13/11/2019        10
  A      14/11/2019    3        14/11/2019        55
  A      14/11/2019    3        15/11/2019        45
  A      14/11/2019    3        16/11/2019        80
  A      15/11/2019    7        13/11/2019        10
  A      15/11/2019    7        14/11/2019        55
  A      15/11/2019    7        15/11/2019        45
  A      15/11/2019    7        16/11/2019        80
  B      13/11/2019    4        13/11/2019        18
  B      13/11/2019    4        14/11/2019        65
  B      13/11/2019    4        15/11/2019        75
  B      13/11/2019    4        16/11/2019        89
  B      14/11/2019    9        13/11/2019        18
  B      14/11/2019    9        14/11/2019        65
  B      14/11/2019    9        15/11/2019        75
  B      14/11/2019    9        16/11/2019        89
  B      15/11/2019    8        13/11/2019        18
  B      15/11/2019    8        14/11/2019        65
  B      15/11/2019    8        15/11/2019        75
  B      15/11/2019    8        16/11/2019        89

I want to keep the rows where both dates are same. I was doing this:

df.drop_duplicates(subset=['Date_1', 'Date_2'])

But it do not work. Can`t figure out how to drop those extra rows?

score 5 · Accepted Answer · answered Nov 21 '19 at 06:50

5

Use boolean indexing with compare both columns:

df1 = df[df['Date_1'] == df['Date_2'])

Or DataFrame.query:

df1 = df.query("Date_1 == Date_2")

answered Nov 21 '19 at 06:50

jezrael

822,522
95
1,334
1,252

Stoner · Answer 2 · 2019-11-21T08:59:23.837

You can use duplicated with the parameter subset for specifying columns to be checked with keep=False, for all duplicates for masking and filtering by boolean indexing. The following should work:

df = df[df.duplicated(subset=['Date_1', 'Date_2'], keep=False)]

Remark: Initially, I may have misread that OP wanted to drop duplicates, with answers below:

df = df.drop_duplicates(subset=['Date_1', 'Date_2'], keep='last')

You should always assign the df back like above to apply the changes. Otherwise you can add inplace=True to apply the changes without assignment:

df.drop_duplicates(subset=['Date_1', 'Date_2'], keep='last', inplace=True)

If this doesn't work as well as the solutions provided by others, you may want to check whether there is any difference in data types for those columns which are in subset, for example datetime formatting.

Actually, that answer does not answer asked question, as it does not "keep the rows where both dates are same." — Artyom Akselrod, Nov 21 '19 at 08:41
@ArtyomAkselrod Thanks for pointing that out. Initially I misread that OP wanted to drop duplicates. Give me a sec, I'll update the answer. — Stoner, Nov 21 '19 at 08:55

How to drop duplicate rows based on values of two columns?

2 Answers2