6

In Python pandas I have a large data frame that looks like this:

df = pd.DataFrame ({'a' : ['foo', 'bar'] * 3,
             'b' : ['foo2', 'bar2'] * 3,
             'c' : ['foo3', 'bar3'] * 3,
             'd' : ['q','w','e','r','t','y'],
             'e' : ['q2','w2','e2','r2','t2','y2']})


     a     b     c  d   e
1  bar  bar2  bar3  w  w2
3  bar  bar2  bar3  r  r2
5  bar  bar2  bar3  y  y2
4  foo  foo2  foo3  t  t2
2  foo  foo2  foo3  e  e2
0  foo  foo2  foo3  q  q2

It contains a dozen of columns with duplicated values (a, b, c...) and a few with unique values (d, e). I would like to remove all duplicated values and collect those that are unique, i.e.:

     a     b     c  d   e
1  bar  bar2  bar3  w,r,y  w2,r2,y2
4  foo  foo2  foo3  t,e,q  t2,e2,q2

We can safely assume that unique values are only in 'd' and 'e', while rest is always duplicated.

One way I could conceive a solution would be to groupby all duplicated columns and then apply a concatenation operation on unique values:

df.groupby([df.a, df.b, df.c]).apply(lambda x: "{%s}" % ', '.join(x.d))

One inconvenience is that I have to list all duplicated columns if I want to have them in my output. More of a problem is fact that I am concatenating only strings in 'd', while also 'e' is needed.

Any suggestions?

Lukasz Tracewski
  • 10,794
  • 3
  • 34
  • 53
  • Just an aside you don't need to pass the columns as df columns the following would suffice: `df.groupby(['a', 'b', 'c']).apply(lambda x: "{%s}" % ', '.join(x.d))` – EdChum Nov 27 '14 at 15:24
  • Thanks! The reason I used df-notation was because originally these names are really long, so I take advantage of autocompletion mechanism. To spare others the details I replaced them with letters. – Lukasz Tracewski Nov 27 '14 at 21:28

2 Answers2

3

I think you can do something like this:

>>> df.groupby(['a', 'b', 'c']).agg(lambda col: ','.join(col))
                   d         e
a   b    c                    
bar bar2 bar3  w,r,y  w2,r2,y2
foo foo2 foo3  q,e,t  q2,e2,t2

Another way to do this and not to list all column but only list ones with unique values

>>> gr_columns = [x for x in df.columns if x not in ['d','e']]
>>> df.groupby(gr_columns).agg(lambda col: ','.join(col))
                   d         e
a   b    c                    
bar bar2 bar3  w,r,y  w2,r2,y2
foo foo2 foo3  q,e,t  q2,e2,t2
Roman Pekar
  • 107,110
  • 28
  • 195
  • 197
  • 3
    Note that `lambda col: ','.join(col)` can be written as `','.join`.. but on second thought, I'm not sure that will work, because I'm not sure how strict the OP wants to be with "unique". (I.e. we might have to get rid of the two `w`s if it were `'w,w,y'` instead.) – DSM Nov 27 '14 at 15:40
  • Thanks, worked great! All values within a group are unique, so the ','.join can work as well. Thanks for suggestion! – Lukasz Tracewski Nov 27 '14 at 21:30
3

you could use df.pivot_table(), although it appears to be slightly slower than df.groupby() (as suggested by Roman's answer):

>>> %timeit df.pivot_table(index=['a','b','c'], values=['c','d','e'], aggfunc=lambda x: ','.join(x)).reset_index()
6.17 ms ± 131 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

vs

>>> %timeit df.groupby(['a', 'b', 'c']).agg(lambda col: ','.join(col)).reset_index()
4.09 ms ± 95.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Also, if you want the new columns to contain an ACTUAL list (and not a comma-separated list-as-string), you can replace the lambda function ','.join(x) with list(x). And if you want the list to only include unique elements, you can change the lambda function to list(set(x)).

MMelnicki
  • 662
  • 2
  • 8
  • 14