1

I want to add the particular columns (C, D, E, F, G) based on the duplicate rows of column B. Whereas the remaining non-duplicate rows unchanged. The output of column A must be the first index of duplicate rows.

I have a dataframe as follows:

A     B     C  D  E  F  G
box1  0487  1     1     1
box2  0487     1     1
blue  0478  1  1        1
gray  0478  1  1  1  1
gray  0478        1  1  1
flat  8704  1  1        1
clay  8704        1     1
dark  8740  1  1  1  1  1
late  4087  1     1     1

I want the output as follows:

A     B     C   D  E  F  G
box1  0487  1   1  1  1  1
blue  0478  2   2  2  2  2
flat  8704  1   1  1     2
dark  8740  1   1  1  1  1
late  4087  1      1     1

I am pleased to hear some suggestions.

ali
  • 151
  • 1
  • 15

1 Answers1

7

Create dictionary of columns names with aggregate functions and pass to agg, also here is necessary min_count=1 to sum for avoid 0 for sum NaNs values:

L = ['C','D','E','F','G']
d = {**dict.fromkeys(L, lambda x: x.sum(min_count=1)), **{'A':'first'}}
df = df.groupby('B', as_index=False, sort=False).agg(d).reindex(columns=df.columns)
print (df)
      A     B    C    D    E    F    G
0  box1  0487  1.0  1.0  1.0  1.0  1.0
1  blue  0478  2.0  2.0  2.0  2.0  2.0
2  flat  8704  1.0  1.0  1.0  NaN  2.0
3  dark  8740  1.0  1.0  1.0  1.0  1.0
4  late  4087  1.0  NaN  1.0  NaN  1.0

d = {**dict.fromkeys(L, 'sum'), **{'A':'first'}}
df = df.groupby('B', as_index=False, sort=False).agg(d).reindex(columns=df.columns)
print (df)
      A     B    C    D    E    F    G
0  box1  0487  1.0  1.0  1.0  1.0  1.0
1  blue  0478  2.0  2.0  2.0  2.0  2.0
2  flat  8704  1.0  1.0  1.0  0.0  2.0
3  dark  8740  1.0  1.0  1.0  1.0  1.0
4  late  4087  1.0  0.0  1.0  0.0  1.0
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • 1
    I had `df.groupby('B').agg( {**{ 'A': lambda x: x.head(1)}, **{i:'sum' for i in df.columns if i!='A'}}) ` . I was too slow :-( – Vivek Kalyanarangan Nov 05 '18 at 12:22
  • 4
    @RavinderSingh13- There is no easy way, in my opinion, I like [tutorials](http://pandas.pydata.org/pandas-docs/stable/tutorials.html) and especially [modern pandas](http://pandas.pydata.org/pandas-docs/stable/tutorials.html#modern-pandas) – jezrael Nov 05 '18 at 12:26
  • With Pandas he is the greatest expertise, kudos to expertise and experience @jezrael – Karn Kumar Nov 05 '18 at 12:35
  • @pygo - I think not, pandas developers like jeff, unutbu and many others are better, but they have no time - obviously create new versions of pandas or working wiht pandas github issuues ;) – jezrael Nov 05 '18 at 12:38
  • @jezrael- your solution works perfectly for the example dataframe neverthless I got the typeerror: cant convert 'int' object to str implicitly for my source code. Can you please suggest something? – ali Nov 05 '18 at 13:28
  • 1
    @Vishali - there is necessary dtypes of columns defined in `L = ['C','D','E','F','G']` have to be numeric, floats or integers. You can check it by `print (df[L].dtypes)` – jezrael Nov 05 '18 at 13:31