1

I use pandas.DataFrame.drop_duplicates() to drop duplicates of rows where all column values are identical, however for data quality analysis, I need to produce a DataFrame with the dropped duplicate rows. How can I identify which are the rows to be dropped? It occurs to me to compare the original DF versus the new one without duplicates and identify the unique indexes missing, but is there a better way to do this?

Example:

import pandas as pd

data =[[1,'A'],[2,'B'],[3,'C'],[1,'A'],[1,'A']]

df = pd.DataFrame(data,columns=['Numbers','Letters'])

df.drop_duplicates(keep='first',inplace=True) # This will drop rows 3 and 4

# Now how to create a dataframe with the duplicate records dropped only?

Code Ninja 2C4U
  • 114
  • 1
  • 11

1 Answers1

10
import pandas as pd

data =[[1,'A'],[2,'B'],[3,'C'],[1,'A'],[1,'A']]

df = pd.DataFrame(data,columns=['Numbers','Letters'])


df.drop_duplicates()

Output

    Numbers Letters
0   1       A
1   2       B
2   3       C

and

df.loc[df.duplicated()]

Output

    Numbers Letters
3   1       A
4   1       A
Chris
  • 15,819
  • 3
  • 24
  • 37