Pandas: How to fill null values with mean of a groupby?

Question

I have a dataset will some missing data that looks like this:

id    category     value
1     A            NaN
2     B            NaN
3     A            10.5
4     C            NaN
5     A            2.0
6     B            1.0

I need to fill in the nulls to use the data in a model. Every time a category occurs for the first time it is NULL. The way I want to do is for cases like category A and B that have more than one value replace the nulls with the average of that category. And for category C with only single occurrence just fill in the average of the rest of the data.

I know that I can simply do this for cases like C to get the average of all the rows but I'm stuck trying to do the categorywise means for A and B and replacing the nulls.

df['value'] = df['value'].fillna(df['value'].mean())

I need the final df to be like this

id    category     value
1     A            6.25
2     B            1.0
3     A            10.5
4     C            4.15
5     A            2.0
6     B            1.0

score 13 · Accepted Answer · answered Oct 28 '16 at 06:15

13

I think you can use groupby and apply fillna with mean. Then get NaN if some category has only NaN values, so use mean of all values of column for filling NaN:

df.value = df.groupby('category')['value'].apply(lambda x: x.fillna(x.mean()))
df.value = df.value.fillna(df.value.mean())
print (df)
   id category  value
0   1        A   6.25
1   2        B   1.00
2   3        A  10.50
3   4        C   4.15
4   5        A   2.00
5   6        B   1.00

answered Oct 28 '16 at 06:15

jezrael

822,522
95
1,334
1,252

'Great help. any way how can I do this for many columns in pandas instead of a single column 'value'. – mari Oct 26 '18 at 11:58
3

@Mari - Use `df = df.groupby('category').apply(lambda x: x.fillna(x.mean())).reset_index(drop=True)` – jezrael Oct 26 '18 at 12:02

score 9 · Answer 2 · answered Aug 10 '18 at 00:15

9

You can also use GroupBy + transform to fill NaN values with groupwise means. This method avoids inefficient apply + lambda. For example:

df['value'] = df['value'].fillna(df.groupby('category')['value'].transform('mean'))
df['value'] = df['value'].fillna(df['value'].mean())

answered Aug 10 '18 at 00:15

jpp

159,742
34
281
339

2

thanks for this, was trying to speed up some of my ETL workflows and this worked a treat. – Umar.H Jun 21 '19 at 10:53

Pandas: How to fill null values with mean of a groupby?

2 Answers2

Linked

Related