2

I need to remove duplicates from all of the columns.

My data:

id   country  publisher   weak     A        B        C
123    US        X          1     6.77      0        0
123    US        X          1       0      1.23     88.7
456    BZ        Y          2       0      56.87    9.65      
456    BZ        Y          2     2.76       0       0  
456    BZ        Y          2       0        0       0

I used drop_duplicates-

df1=df.drop_duplicates()

But I need a condition that will take all the values>0 for each id.

Also, I have more columns than just 'A','B','C' so I'm looking for solution that will take all the columns into account.

Here an example for what I'm looking for:

id   country  publisher  weak     A       B        C
123    US        X        1     6.77     1.23     88.7
456    BZ        Y        2     2.76     56.87    9.65
Lili
  • 371
  • 3
  • 13

2 Answers2

1

This will give you your desired output
groups=df.groupby(['id','country','publisher']).sum()

Arpit
  • 394
  • 1
  • 11
  • I have other columns that contain numbers that I don't want to sum. I will add that to my data, so it will be clear. – Lili Aug 12 '20 at 13:29
  • I thought u have these columns only – Arpit Aug 12 '20 at 13:34
  • 1
    Then `groups=df.groupby(['id','country','publisher']).max()` will work. – Arpit Aug 12 '20 at 13:55
  • Thanks, It works. My problem now is that I can't drop the id column. I receive this error: "['id'] not found in axis". Any idea how to handle this? – Lili Aug 13 '20 at 08:13
  • 1
    Just simply use `groups=groups.reset_index()` and then `groups.drop(['id'],axis=1,inplace=True)` – Arpit Aug 13 '20 at 08:18
0

Try doing:

cols = ['A', 'B'] # change columns to aggregate more data

def app_func(s):
    return s[~s.eq(0)].bfill().dropna().drop_duplicates()

df.groupby(['id', 'country', 'publisher'])[cols].apply(app_func).reset_index()
Ayoub ZAROU
  • 2,387
  • 6
  • 20