-2

I'm conducting an A/B test and looking for an effective way to delete duplicate users ID's (visitorId column) that appear in both groups: the experiment and the control.

Here is an example:

visitorId date group
4256040402 2019-08-31 A
4256040402 2019-08-31 B
4256040402 2019-08-27 A
4256040402 2019-08-20 B

And the desired result:

visitorId date group
4256040402 2019-08-31 A
4256040402 2019-08-27 A
4256040402 2019-08-20 B

I'm looking for an efficient way that takes into account the date (date column) and deletes duplicates but on the condition that it takes place in both groups and on the same day.

1 Answers1

0

Try with drop_duplicates(subset=['visitorId', 'date'])

print(df)

    visitorId        date group
0  4256040402  2019-08-31     A
1  4256040402  2019-08-31     B
2  4256040402  2019-08-27     A
3  4256040402  2019-08-20     B

df = df.drop_duplicates(subset=['visitorId', 'date'])

print(df)

    visitorId        date group
0  4256040402  2019-08-31     A
2  4256040402  2019-08-27     A
3  4256040402  2019-08-20     B

dnyll
  • 108
  • 1
  • 8