In Python pandas I have a large data frame that looks like this:
df = pd.DataFrame ({'a' : ['foo', 'bar'] * 3,
'b' : ['foo2', 'bar2'] * 3,
'c' : ['foo3', 'bar3'] * 3,
'd' : ['q','w','e','r','t','y'],
'e' : ['q2','w2','e2','r2','t2','y2']})
a b c d e
1 bar bar2 bar3 w w2
3 bar bar2 bar3 r r2
5 bar bar2 bar3 y y2
4 foo foo2 foo3 t t2
2 foo foo2 foo3 e e2
0 foo foo2 foo3 q q2
It contains a dozen of columns with duplicated values (a, b, c...) and a few with unique values (d, e). I would like to remove all duplicated values and collect those that are unique, i.e.:
a b c d e
1 bar bar2 bar3 w,r,y w2,r2,y2
4 foo foo2 foo3 t,e,q t2,e2,q2
We can safely assume that unique values are only in 'd' and 'e', while rest is always duplicated.
One way I could conceive a solution would be to groupby all duplicated columns and then apply a concatenation operation on unique values:
df.groupby([df.a, df.b, df.c]).apply(lambda x: "{%s}" % ', '.join(x.d))
One inconvenience is that I have to list all duplicated columns if I want to have them in my output. More of a problem is fact that I am concatenating only strings in 'd', while also 'e' is needed.
Any suggestions?