Assign groupby-apply result to parent dataframe

Question

I have the following data frame:

df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar','foo', 'bar', 'foo', 'foo'],
                   'B' : ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'],
                   'C' : np.random.randn(8),
                   'D' : np.random.randn(8)})

    A   B   C   D
0   foo one 0.478183    -1.267588
1   bar one 0.555985    -2.143590
2   foo two -1.592865   1.251546
3   bar three   0.174138    -0.708198
4   foo two 0.302215    -0.219041
5   bar two -0.034550   -0.965414
6   foo one 1.310828    -0.388601
7   foo three   0.357659    -1.610443

I'm trying to add another column which will be a normalized version of column C over partition by A:

normed = df.groupby('A').apply(lambda x: (x['C']-min(x['C']))/(max(x['C'])-min(x['C'])))

A     
bar  1    0.000000
     3    0.033396
     5    1.000000
foo  0    1.000000
     2    0.413716
     4    0.000000
     6    0.441061
     7    0.357787

Finally I want to join this result back to df (using advice from the similar question):

df.join(normed, on='A', rsuffix='_normed')

However, I get an error:

ValueError: len(left_on) must equal the number of levels in the index of "right"

How can I add normed result back to dataframe df?

Note that the problem basically goes away if you use `transform` instead of `apply`. Also you can do `groupby('A')['C']` instead of `groupby('A')` for much cleaner code. See my answer below for the full syntax. — JohnE, Nov 14 '16 at 21:07

score 3 · Accepted Answer · answered Nov 14 '16 at 16:10

3

You get this error because you have a MultiIndex with length 2 in the first level. The second level is the original index.

normed.index

Out[35]:

MultiIndex(levels=[['bar', 'foo'], [0, 1, 2, 3, 4, 5, 6, 7]],
           labels=[[0, 0, 0, 1, 1, 1, 1, 1], [1, 3, 5, 0, 2, 4, 6, 7]],
           names=['A', None])

You probably want to join on the Original index, so you must drop the first level of the new index

normed.index = normed.index.droplevel()

before joining:

df.join(normed, rsuffix='_normed')

answered Nov 14 '16 at 16:10

Maarten Fabré

6,938
1
17
36

Nice one, I did not know the `.droplevel()` method – MMF Nov 14 '16 at 16:13
This is a neat solution. I had thought you'd need to `.reset_index()` on `normed` and then do some fancy layout changes. This is a nice simple approach to reuse the original index. – Phil Sheard Nov 14 '16 at 16:14

score 2 · Answer 2 · answered Nov 14 '16 at 16:14

The simplest way is to apply reset_index to the normed

normed = df.groupby('A').apply(lambda x: (x['C']-min(x['C']))/(max(x['C'])-min(x['C'])))
normed = normed.reset_index(level=0, drop=True)

And now simply add normed as a column to df

df['normed'] = normed

JohnE · Answer 3 · 2016-11-14T21:19:24.363

2

Actually, there is a very easy solution. When groupby is doing a one-for-one operation (rather than a reduction), you can use transform and the indexing is already taken care of for you:

df['c_normed'] = df.groupby('A')['C'].transform( lambda x: (x-min(x))/(max(x)-min(x)))

Also note that the code is a bit cleaner if you use df.groupby('A')['C'] because then you can just use x instead of x['C'] inside the lambda. And also in this case using x['C'] works with apply but not transform (I am not sure why...).

edited Nov 14 '16 at 21:19

answered Nov 14 '16 at 16:46

JohnE

29,156
8
79
109

Doesn't the groupby distort the order? To be safe you could add a.`sort_index(level=1)` – Maarten Fabré Nov 14 '16 at 20:38
@MaartenFabré Thanks, you are correct. I have updated my answer so that is no longer a problem. – JohnE Dec 07 '16 at 15:41

score 1 · Answer 4 · answered Nov 14 '16 at 16:09

What you can do is the following :

# Get tuples (index, value) for each level
foo = zip(normed.foo.index, normed.foo.values)
bar = zip(normed.bar.index, normed.bar.values)

# Merge the two lists
foo.extend(bar) # merged lists contained in foo

# Sort the list
new_list = sorted(foo, key=lambda x: x[0])

# Create new column in dataframe
index, values = zip(*new_list) # unzip
df['New_column'] = values

Output

Out[85]: 
 A      B         C         D  New_column
0  foo    one  0.039683 -0.041559    0.638594
1  bar    one -0.090650 -2.316097    0.000000
2  foo    two  0.024210  0.616764    0.629815
3  bar  three  0.142740  0.156198    0.450339
4  foo    two -1.085916 -0.432832    0.000000
5  bar    two  0.427604 -1.154850    1.000000
6  foo    one -0.156424  0.037188    0.527335
7  foo  three  0.676706 -1.336921    1.000000

NB : Maybe there is a cleverer way to do this.

score 1 · Answer 5 · answered Nov 14 '16 at 16:12

You have to get rid of the the first-level of the multi-index created by groupby first (i.e. 'Foo' and 'Bar').

Adding the following code should work:

normed = normed.reset_index(level=0)
del normed['A']
normed.rename(columns={'C':'C_normed'}, inplace=True)
pd.concat([df, normed], axis=1)

Result:

A   B   C   D   C_normed
0   foo one 1.697923    0.656727    1.000000
1   bar one -0.626052   -0.466088   0.000000
2   foo two -0.501440   1.080408    0.000000
3   bar three   0.731791    -1.531915   1.000000
4   foo two -0.202666   0.275042    0.135846
5   bar two -0.340455   -0.737039   0.210332
6   foo one 0.506664    1.049853    0.458362
7   foo three   -0.358317   -0.598262   0.065075

Assign groupby-apply result to parent dataframe

5 Answers5