0
import pandas as pd

data=[["John","Alzheimer's","Infection","Alzheimer's"],["Kevin","Pneumonia","Pneumonia","Tuberculosis"]]
df=pd.DataFrame(data,columns=['Name','Problem1','Problem2','Problem3'])

enter image description here

In this data frame, I would like to read through each row and remove duplicates so that each person's problem is only reported once. This would mean removing "Alzheimer's" as a duplicate in row 1. I tried the drop_duplicates() function but this removes the entire row.

Any help would be appreciated!

blackraven
  • 5,284
  • 7
  • 19
  • 45
dawgtor
  • 11
  • 1

4 Answers4

0

first to recreate an example of data:

import pandas as pd
data=[["John","Alzheimer's","Infection","Alzheimer's"],["Kevin","Pneumonia","Pneumonia","Tuberculosis"]]
df=pd.DataFrame(data,columns=['Name','Problem1','Problem2','Problem3'])

df

enter image description here

Now to delete or replace the duplicate with empty space:

df['Problem2']=df.apply(lambda x:x["Problem2"] if not(x["Problem2"]==x['Problem1']) else " ",axis=1)


df['Problem3']=df.apply(lambda x:x["Problem3"] if not(x["Problem3"]==x['Problem2'] or x["Problem3"]==x['Problem1']) else " ",axis=1)
df

enter image description here

Ran A
  • 746
  • 3
  • 7
  • 19
  • Thank you for your help! My actual data frame has about 900 columns. Would there be an easier way to do this for all of them? – dawgtor Aug 12 '22 at 22:15
  • @jroy1 - see my answer using apply and duplicated. It will apply to all columns simultaneously – Joe Carboni Aug 12 '22 at 23:04
0

You can try to use the df.duplicated-function for this. This works similar to df.drop_duplicates but returns a boolean series instead of removing the duplicates. You can then index your initial dataframe by this boolean series setting the values to None.

MangoNrFive
  • 1,541
  • 1
  • 10
  • 23
0

Use apply and duplicated.

Make sure to use the axis=1 argument on apply to apply to rows instead of columns. duplicated will return a boolean series which will set the first occurrence to 'False' by default. Using the opposite of this series with ~ will keep our non-duplicated values and leave out the duplicated ones.

Setup of example

import pandas as pd

data=[["John","Alzheimer's","Infection","Alzheimer's"],["Kevin","Pneumonia","Pneumonia","Tuberculosis"]]
df=pd.DataFrame(data,columns=['Name','Problem1','Problem2','Problem3'])

df
    Name     Problem1   Problem2      Problem3
0   John  Alzheimer's  Infection   Alzheimer's
1  Kevin    Pneumonia  Pneumonia  Tuberculosis

Deduplication

deduped_df = df.apply(lambda row: row[~row.duplicated()],axis=1)

output

>>> deduped_df
    Name     Problem1   Problem2      Problem3
0   John  Alzheimer's  Infection           NaN
1  Kevin    Pneumonia        NaN  Tuberculosis
Joe Carboni
  • 421
  • 1
  • 6
0

I wouldn't use a wide-style data frame. I would turn it into long. Thus:

data = [["John", "Alzheimer's", "Infection", "Alzheimer's"],
        ["Kevin", "Pneumonia", "Pneumonia", "Tuberculosis"]]
df = pd.DataFrame(data, columns=['Name', 'Problem1', 'Problem2', 'Problem3'])

df.rename(columns=str.lower, inplace=True)  # personal preference
long_df = pd.wide_to_long(df, 'problem', i='name', j='index').sort_index()

This yields a table that looks like this:

                  problem
name  index              
John  1       Alzheimer's
      2         Infection
      3       Alzheimer's
Kevin 1         Pneumonia
      2         Pneumonia
      3      Tuberculosis

Then you can just de-duplicate like normal:

>>> long_df.reset_index().drop_duplicates(['name', 'problem'])
    name  index       problem
0   John      1   Alzheimer's
1   John      2     Infection
3  Kevin      1     Pneumonia
5  Kevin      3  Tuberculosis

If it's a real problem and you really need it in wide format, you can just df.unstack() then rename the problem-index multi-index to turn it back into a wide data frame with the same form as you used to have.

ifly6
  • 5,003
  • 2
  • 24
  • 47