Is there a function to remove duplicates within a row without removing the entire row using Python?

Question

import pandas as pd

data=[["John","Alzheimer's","Infection","Alzheimer's"],["Kevin","Pneumonia","Pneumonia","Tuberculosis"]]
df=pd.DataFrame(data,columns=['Name','Problem1','Problem2','Problem3'])

In this data frame, I would like to read through each row and remove duplicates so that each person's problem is only reported once. This would mean removing "Alzheimer's" as a duplicate in row 1. I tried the drop_duplicates() function but this removes the entire row.

Any help would be appreciated!

Please provide your DF in terms of code, preferably with a dictionary constructor (see `df.to_dict()`). — ifly6, Aug 12 '22 at 15:02

score 0 · Answer 1 · answered Aug 12 '22 at 15:04

0

first to recreate an example of data:

import pandas as pd
data=[["John","Alzheimer's","Infection","Alzheimer's"],["Kevin","Pneumonia","Pneumonia","Tuberculosis"]]
df=pd.DataFrame(data,columns=['Name','Problem1','Problem2','Problem3'])

df

Now to delete or replace the duplicate with empty space:

df['Problem2']=df.apply(lambda x:x["Problem2"] if not(x["Problem2"]==x['Problem1']) else " ",axis=1)


df['Problem3']=df.apply(lambda x:x["Problem3"] if not(x["Problem3"]==x['Problem2'] or x["Problem3"]==x['Problem1']) else " ",axis=1)
df

answered Aug 12 '22 at 15:04

Ran A

746
3
7
19

Thank you for your help! My actual data frame has about 900 columns. Would there be an easier way to do this for all of them? – dawgtor Aug 12 '22 at 22:15
@jroy1 - see my answer using apply and duplicated. It will apply to all columns simultaneously – Joe Carboni Aug 12 '22 at 23:04

score 0 · Answer 2 · answered Aug 12 '22 at 15:05

0

You can try to use the df.duplicated-function for this. This works similar to df.drop_duplicates but returns a boolean series instead of removing the duplicates. You can then index your initial dataframe by this boolean series setting the values to None.

answered Aug 12 '22 at 15:05

MangoNrFive

1,541
1
10
23

I provided an example of using `duplicated` in my answer if you want to review it – Joe Carboni Aug 12 '22 at 22:40

score 0 · Answer 3 · answered Aug 12 '22 at 22:30

Use apply and duplicated.

Make sure to use the axis=1 argument on apply to apply to rows instead of columns. duplicated will return a boolean series which will set the first occurrence to 'False' by default. Using the opposite of this series with ~ will keep our non-duplicated values and leave out the duplicated ones.

Setup of example

import pandas as pd

data=[["John","Alzheimer's","Infection","Alzheimer's"],["Kevin","Pneumonia","Pneumonia","Tuberculosis"]]
df=pd.DataFrame(data,columns=['Name','Problem1','Problem2','Problem3'])

df
    Name     Problem1   Problem2      Problem3
0   John  Alzheimer's  Infection   Alzheimer's
1  Kevin    Pneumonia  Pneumonia  Tuberculosis

Deduplication

deduped_df = df.apply(lambda row: row[~row.duplicated()],axis=1)

output

>>> deduped_df
    Name     Problem1   Problem2      Problem3
0   John  Alzheimer's  Infection           NaN
1  Kevin    Pneumonia        NaN  Tuberculosis

score 0 · Answer 4 · answered Aug 13 '22 at 20:57

I wouldn't use a wide-style data frame. I would turn it into long. Thus:

data = [["John", "Alzheimer's", "Infection", "Alzheimer's"],
        ["Kevin", "Pneumonia", "Pneumonia", "Tuberculosis"]]
df = pd.DataFrame(data, columns=['Name', 'Problem1', 'Problem2', 'Problem3'])

df.rename(columns=str.lower, inplace=True)  # personal preference
long_df = pd.wide_to_long(df, 'problem', i='name', j='index').sort_index()

This yields a table that looks like this:

                  problem
name  index              
John  1       Alzheimer's
      2         Infection
      3       Alzheimer's
Kevin 1         Pneumonia
      2         Pneumonia
      3      Tuberculosis

Then you can just de-duplicate like normal:

>>> long_df.reset_index().drop_duplicates(['name', 'problem'])
    name  index       problem
0   John      1   Alzheimer's
1   John      2     Infection
3  Kevin      1     Pneumonia
5  Kevin      3  Tuberculosis

If it's a real problem and you really need it in wide format, you can just df.unstack() then rename the problem-index multi-index to turn it back into a wide data frame with the same form as you used to have.

Is there a function to remove duplicates within a row without removing the entire row using Python?

4 Answers4