Replacing values with groupby means

Question

I have a DataFrame with a column that has some bad data with various negative values. I would like to replace values < 0 with the mean of the group that they are in.

For missing values as NAs, I would do:

data = df.groupby(['GroupID']).column
data.transform(lambda x: x.fillna(x.mean()))

But how to do this operation on a condition like x < 0?

Thanks!

unutbu · Accepted Answer · 2013-02-07T22:05:41.650

13

Using @AndyHayden's example, you could use groupby/transform with replace:

df = pd.DataFrame([[1,1],[1,-1],[2,1],[2,2]], columns=list('ab'))
print(df)
#    a  b
# 0  1  1
# 1  1 -1
# 2  2  1
# 3  2  2

data = df.groupby(['a'])
def replace(group):
    mask = group<0
    # Select those values where it is < 0, and replace
    # them with the mean of the values which are not < 0.
    group[mask] = group[~mask].mean()
    return group
print(data.transform(replace))
#    b
# 0  1
# 1  1
# 2  1
# 3  2

edited Feb 07 '13 at 22:05

answered Feb 07 '13 at 21:58

unutbu

842,883
184
1,785
1,677

I was thinking something similar -- `.transform(lambda x: x.where(x>=0).fillna(x[x>=0].mean()))` but didn't like the repetition of the condition. Your approach bypasses that nicely. The pattern seems common enough that I wonder if `pandas` should grow a built-in way to support it. – DSM Feb 07 '13 at 22:03
@DSM I agree it would be nice to see some way to do this (this numpy-foo is impressive!) :) – Andy Hayden Feb 07 '13 at 22:10
very nice. I haven't seen the use of [~mask] - does the tilda just mean `not mask`? – zach Feb 07 '13 at 23:22
@zach: `mask` is a `Series`, which is a subclass of numpy's `ndarray` class. The tilde is the [invert operator](http://docs.scipy.org/doc/numpy/reference/generated/numpy.invert.html#numpy.invert) when applied to a numpy ndarray. Since `mask` is of dtype `bool` (i.e. a boolean array) the inversion is done bit-wise for each element in the array. `not mask` has a different meaning. This asks Python to reduce `mask` to its boolean value as a whole object and then take the negation. Numpy arrays refuse to be characterized as True or False. `not mask` raises a `ValueError`. – unutbu Feb 08 '13 at 01:37
Word of warning: Your custom function should not modify the group in-place. It should make a copy, and modify that instead. See https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.transform.html – Nick Crews Feb 14 '22 at 23:16
What does the ~ sign mean in front of the mask parameter? – Mark Molina Jun 14 '23 at 11:21

score 3 · Answer 2 · answered Feb 07 '13 at 21:51

Here's one way to do it (for the 'b' column, in this boring example):

In [1]: df = pd.DataFrame([[1,1],[1,-1],[2,1],[2,2]], columns=list('ab'))
In [2]: df
Out[2]: 
   a  b
0  1  1
1  1 -1
2  2  1
3  2  2

Replace those negative values with NaN, and then calculate the mean (b) in each group:

In [3]: df['b'] = df.b.apply(lambda x: x if x>=0 else pd.np.nan)
In [4]: m = df.groupby('a').mean().b

Then use apply across each row, to replace each NaN with its groups mean:

In [5]: df['b'] = df.apply(lambda row: m[row['a']]
                                       if pd.isnull(row['b'])
                                       else row['b'],
                           axis=1) 
In [6]: df
Out[6]: 
   a  b
0  1  1
1  1  1
2  2  1
3  2  2

I see, by applying that lambda function I can make them nans! Then couldn't I just use the `.fillna` line I wrote in my question? Your second `.apply` seems unnecessary. — Def_Os, Feb 07 '13 at 22:01
@Def_Os I couldn't get that line to work for me, but perhaps that is a better way :) — Andy Hayden, Feb 07 '13 at 22:04

solub · Answer 3 · 2017-11-14T15:58:44.883

2

I had the same issue and came up with a rather simple solution

func = lambda x : np.where(x < 0, x.mean(), x)

df['Bad_Column'].transform(func)

Note that if you want to return the mean of the correct values (mean based on positive values only) you'd have to specify:

func = lambda x : np.where(x < 0, x.mask(x < 0).mean(), x)

edited Nov 14 '17 at 15:58

answered Nov 14 '17 at 11:42

solub

1,291
17
40

score 1 · Answer 4 · answered May 25 '17 at 21:48

1

There is a great Example, for your additional question.

df = pd.DataFrame({'A' : [1, 1, 2, 2], 'B' : [1, -1, 1, 2]})
gb = df.groupby('A')
def replace(g):
   mask = g < 0
   g.loc[mask] = g[~mask].mean()
   return g
gb.transform(replace)

Link: http://pandas.pydata.org/pandas-docs/stable/cookbook.html

answered May 25 '17 at 21:48

BENY

317,841
20
164
234

Ahh, I understand. Your answer doesn't generalize to to more complicated dataframes. Doesn't matter, very smart! I'm going to continue playing around. – piRSquared Sep 25 '18 at 16:10

Replacing values with groupby means

4 Answers4