0

I used pandas to get a list of all Email duplicates, but not all email duplicates are in fact duplicates of a contact, because the company may be small, so that all employees have the same email-address for example.

Email FirstName LastName Phone Mobile Company
a@company-a.com John Doe 12342 65464 Company_a
a@company-a.com John Doe 43214 45645 Comp_ny A
a@company-a.com Adam Smith 34223 46456 Company A
b@company-b.com Bill Gates 23423 63453 Company B
b@company-b.com Bill Gates 32421 43244 Comp B
b@company-b.com Elon Musk 42342 34234 Company B

That's why I came up with the following condition to filter my Email duplicate list further down:

I want to extract all the cases where the Email, FirstName and LastName are equal in a dataframe because that almost certainly would mean that this is a real duplicate. The extracted dataframe should look like this in the end:

Email FirstName LastName Phone Mobile Company
a@company-a.com John Doe 12342 65464 Company_a
a@company-a.com John Doe 43214 45645 Comp_ny A
b@company-b.com Bill Gates 23423 63453 Company B
b@company-b.com Bill Gates 32421 43244 Comp B

How can I get there? Is it possible to check for multiple equal conditions?

I would appreciate any feedback regarding the best practices.

Thank you!

Epsi95
  • 8,832
  • 1
  • 16
  • 34
Lekü
  • 53
  • 1
  • 6
  • Does this answer your question? [Grouping by multiple columns to find duplicate rows pandas](https://stackoverflow.com/questions/46640945/grouping-by-multiple-columns-to-find-duplicate-rows-pandas) – busybear Jan 27 '21 at 16:11

1 Answers1

0

Use pd.drop_duplicates

df.drop_duplicates(subset=['Email', 'FirstName', 'LastName'], keep='first')

output

Email   FirstName   LastName    Phone   Mobile  Company
0   a@company-a.com John    Doe 12342   65464   Company_a
2   a@company-a.com Adam    Smith   34223   46456   Company A
3   b@company-b.com Bill    Gates   23423   63453   Company B
5   b@company-b.com Elon    Musk    42342   34234   Company B

To get the duplicates

df[~df.index.isin(df.drop_duplicates(subset=['Email', 'FirstName', 'LastName'], keep='first').index)]

output


Email   FirstName   LastName    Phone   Mobile  Company
1   a@company-a.com John    Doe 43214   45645   Comp_ny A
4   b@company-b.com Bill    Gates   32421   43244   Comp B
Epsi95
  • 8,832
  • 1
  • 16
  • 34