1

I would like to drop all values which are duplicates across a subset of two or more columns, without removing the entire row.

Dataframe:

    A   B   C
0   foo g   A
1   foo g   G
2   yes y   B
3   bar y   B

Desired result:

    A   B   C
0   foo g   A
1   NaN NaN G
2   yes y   B
3   bar Nan NaN

I have tried the drop_duplicates() feature by grouping data into new data frames by columns and then re-appending them together, but this had its own issues.

I have also tried this solution and this one, but still am stuck. Any guidance would be much appreciated.

(updated original question)

btroppo
  • 23
  • 4

3 Answers3

1

Go through the codes. You will clearly see the difference between mask and where.

import pandas as pd
import numpy as np


df = pd.DataFrame(columns=['A','B','C'])
df['A'] = ['foo','foo', 'yes','bar' ]
df['B'] = ['g','g', 'y', 'y']
df['C'] = ['A','G','B','B']
print(df)
"""
     A  B  C
0  foo  g  A
1  foo  g  G
2  yes  y  B
3  bar  y  B

"""

aa = df.apply(pd.Series.duplicated)
print(aa)
"""
       A      B      C
0  False  False  False
1   True   True  False
2  False  False  False
3  False   True   True
"""
using_where = df.where(~aa)
print(using_where)
"""
    A    B    C
0  foo    g    A
1  NaN  NaN    G
2  yes    y    B
3  bar  NaN  NaN

"""
using_mask = df.mask(aa)
print(using_mask)

"""
     A    B    C
0  foo    g    A
1  NaN  NaN    G
2  yes    y    B
3  bar  NaN  NaN
"""
Soudipta Dutta
  • 1,353
  • 1
  • 12
  • 7
  • 1
    It's a great answer. Can I ask you what the flag `~` does in where condition? – HSL Mar 17 '23 at 02:17
0

Without removing the entire rows, you can filter the duplicated value with NaN.

#df : your dataframe    
for c_name in df.columns:
      duplicated = df.duplicated(c_name)
      df.loc[duplicated, [c_name]] = np.NaN
    
    print(df)

I referred to this.

HSL
  • 148
  • 6
  • thanks for your answer @HSL. In the above example i have wanted these removed as the duplicates occur across a subset of more than 1 column (2) but not 1 column alone. Do you know how to change your code to only replace duplicates with NaN if they occur over >2 columns? (I have updated my original question) – btroppo Mar 17 '23 at 21:22
  • @btroppo I cannot understand what you say. What do you want to add to this? This table that you explains is 4 rows and 3 columns. You mean, check duplicates over more 2 columns, not 1? Because this code is checking just 1 column? – HSL Mar 18 '23 at 04:51
  • sorry, yes thats right, your code removes duplicates occuring in each column but if its possible i need duplicates removed if they occur over >2 columns? For example if i use your code for the dataframe i gave in the example but in col A row 0 a different string was present, say ”eg”, then the output dataframe would replace col B row 1 ”g” with NaN, but i dont want this – btroppo Mar 19 '23 at 02:11
0

try this:

result = df.mask(df.apply(pd.Series.duplicated))
print(result)
>>>
     A    B    C
0  foo    g    A
1  NaN  NaN    G
2  yes    y    B
3  bar  NaN  NaN
ziying35
  • 1,190
  • 3
  • 6
  • Many thanks im sorry i should have been a bit clearer and have updated original post, i want to keep data in certain columns when they occur as duplicates only in the one column. For example, in the above example i have wanted these removed as the duplicates occur across a subset of more than 1 column (2) but not one column alone? Does that make sense? – btroppo Mar 16 '23 at 20:18