Finding Duplicates based on equal values in multiple columns

Question

I used pandas to get a list of all Email duplicates, but not all email duplicates are in fact duplicates of a contact, because the company may be small, so that all employees have the same email-address for example.

Email	FirstName	LastName	Phone	Mobile	Company
a@company-a.com	John	Doe	12342	65464	Company_a
a@company-a.com	John	Doe	43214	45645	Comp_ny A
a@company-a.com	Adam	Smith	34223	46456	Company A
b@company-b.com	Bill	Gates	23423	63453	Company B
b@company-b.com	Bill	Gates	32421	43244	Comp B
b@company-b.com	Elon	Musk	42342	34234	Company B

That's why I came up with the following condition to filter my Email duplicate list further down:

I want to extract all the cases where the Email, FirstName and LastName are equal in a dataframe because that almost certainly would mean that this is a real duplicate. The extracted dataframe should look like this in the end:

Email	FirstName	LastName	Phone	Mobile	Company
a@company-a.com	John	Doe	12342	65464	Company_a
a@company-a.com	John	Doe	43214	45645	Comp_ny A
b@company-b.com	Bill	Gates	23423	63453	Company B
b@company-b.com	Bill	Gates	32421	43244	Comp B

How can I get there? Is it possible to check for multiple equal conditions?

I would appreciate any feedback regarding the best practices.

Thank you!

Does this answer your question? [Grouping by multiple columns to find duplicate rows pandas](https://stackoverflow.com/questions/46640945/grouping-by-multiple-columns-to-find-duplicate-rows-pandas) — busybear, Jan 27 '21 at 16:11

Epsi95 · Answer 1 · 2021-01-27T16:23:56.590

Use pd.drop_duplicates

df.drop_duplicates(subset=['Email', 'FirstName', 'LastName'], keep='first')

output

Email   FirstName   LastName    Phone   Mobile  Company
0   a@company-a.com John    Doe 12342   65464   Company_a
2   a@company-a.com Adam    Smith   34223   46456   Company A
3   b@company-b.com Bill    Gates   23423   63453   Company B
5   b@company-b.com Elon    Musk    42342   34234   Company B

To get the duplicates

df[~df.index.isin(df.drop_duplicates(subset=['Email', 'FirstName', 'LastName'], keep='first').index)]

output


Email   FirstName   LastName    Phone   Mobile  Company
1   a@company-a.com John    Doe 43214   45645   Comp_ny A
4   b@company-b.com Bill    Gates   32421   43244   Comp B

Finding Duplicates based on equal values in multiple columns

1 Answers1