Remove duplicates from all columns with condition value>0 in columns in data frame

Question

I need to remove duplicates from all of the columns.

My data:

id   country  publisher   weak     A        B        C
123    US        X          1     6.77      0        0
123    US        X          1       0      1.23     88.7
456    BZ        Y          2       0      56.87    9.65      
456    BZ        Y          2     2.76       0       0  
456    BZ        Y          2       0        0       0

I used drop_duplicates-

df1=df.drop_duplicates()

But I need a condition that will take all the values>0 for each id.

Also, I have more columns than just 'A','B','C' so I'm looking for solution that will take all the columns into account.

Here an example for what I'm looking for:

id   country  publisher  weak     A       B        C
123    US        X        1     6.77     1.23     88.7
456    BZ        Y        2     2.76     56.87    9.65

What if a single column of duplicate rows contain more than one value that is >0? — Dinesh, Aug 12 '20 at 13:07
Not sure if that's your business logic, but it appears to me that you could simply sum A, B and C for each unique id into a single entry, not remove duplicates. — Marek Piotrowski, Aug 12 '20 at 13:13
I have other columns that contain numbers that I don't want to sum. I will add that to my data, so it will be clear. — Lili, Aug 12 '20 at 13:33

score 1 · Accepted Answer · answered Aug 12 '20 at 13:24

1

This will give you your desired output
groups=df.groupby(['id','country','publisher']).sum()

answered Aug 12 '20 at 13:24

Arpit

394
1
11

I have other columns that contain numbers that I don't want to sum. I will add that to my data, so it will be clear. – Lili Aug 12 '20 at 13:29
I thought u have these columns only – Arpit Aug 12 '20 at 13:34
1

Then `groups=df.groupby(['id','country','publisher']).max()` will work. – Arpit Aug 12 '20 at 13:55
Thanks, It works. My problem now is that I can't drop the id column. I receive this error: "['id'] not found in axis". Any idea how to handle this? – Lili Aug 13 '20 at 08:13
1

Just simply use `groups=groups.reset_index()` and then `groups.drop(['id'],axis=1,inplace=True)` – Arpit Aug 13 '20 at 08:18

Ayoub ZAROU · Answer 2 · 2020-08-12T13:44:44.883

0

Try doing:

cols = ['A', 'B'] # change columns to aggregate more data

def app_func(s):
    return s[~s.eq(0)].bfill().dropna().drop_duplicates()

df.groupby(['id', 'country', 'publisher'])[cols].apply(app_func).reset_index()

edited Aug 12 '20 at 13:44

answered Aug 12 '20 at 13:21

Ayoub ZAROU

2,387
6
20

How can I change that for more than just 'A', 'B' columns. I have many more columns to aggregate. – Lili Aug 12 '20 at 13:39
change the cols variable to your desired columns – Ayoub ZAROU Aug 12 '20 at 13:45

Remove duplicates from all columns with condition value>0 in columns in data frame

2 Answers2