How to select duplicate rows with pandas?

Question

I have a dataframe like this:

import pandas as pd
dic = {'A':[100,200,250,300],
       'B':['ci','ci','po','pa'],
       'C':['s','t','p','w']}
df = pd.DataFrame(dic)

My goal is to separate the row in 2 dataframes:

df1 = contains all the rows that do not repeat values along column B (unque rows).
df2 = containts only the rows who repeat themeselves.

The result should look like this:

df1 =      A  B C         df2 =     A  B C
      0  250 po p               0  100 ci s 
      1  300 pa w               1  250 ci t

Note:

the dataframes could be in general very big and have many values that repeat in column B so the answer should be as generic as possible
- if there are no duplicates, df2 should be empty! all the results should be in df1

score 29 · Accepted Answer · edited Jul 13 '17 at 10:27

29

You can use Series.duplicated with parameter keep=False to create a mask for all duplicates and then boolean indexing, ~ to invert the mask:

mask = df.B.duplicated(keep=False)
print (mask)
0     True
1     True
2    False
3    False
Name: B, dtype: bool

print (df[mask])
     A   B  C
0  100  ci  s
1  200  ci  t

print (df[~mask])
     A   B  C
2  250  po  p
3  300  pa  w

edited Jul 13 '17 at 10:27

SergiyKolesnikov

7,369
2
26
47

answered Dec 08 '16 at 15:29

jezrael

822,522
95
1,334
1,252

The answer is quite good but it is not generic enough since if there are no duplicates, I get df[mask] as full. I will update the question. – Federico Gentile Dec 08 '16 at 15:41
I don't understand what's your problem with this answer, even looking at your update to your original question – Julien Marrec Dec 08 '16 at 15:44
@FedericoGentile - Do you think test if dataframe is empty? `if df2.empty: print ('empty') else: print ('not empty')` – jezrael Dec 09 '16 at 06:48
No problem, I already found my issue... the answer is perfect... I just called a variable with another name and I had weird results – Federico Gentile Dec 09 '16 at 08:09
How would you ignore `Nan` and `null` during the `duplicated`? – Superdooperhero Jan 26 '20 at 09:15
@Superdooperhero - Can you explain more? Something like `df.B.duplicated(keep=False) | df.B.isna()` ? – jezrael Jan 26 '20 at 09:18
@jezrael I don't want to know about `Nan` duplicates. So ignore rows where the column that is duplicated has `Nan`. – Superdooperhero Jan 26 '20 at 09:21
1

@Superdooperhero - OK, then `df.B.duplicated(keep=False) & df.B.notna()` should working – jezrael Jan 26 '20 at 09:23

How to select duplicate rows with pandas?

1 Answers1

Linked

Related