2

I have a dataFrame where 'value'column has missing values. I'd like to filling missing values by weighted average within each 'name' group. There was post on how to fill the missing values by simple average in each group but not weighted average. Thanks a lot!

df = pd.DataFrame({'value': [1, np.nan, 3, 2, 3, 1, 3, np.nan, np.nan],'weight':[3,1,1,2,1,2,2,1,1], 'name': ['A','A', 'A','B','B','B', 'C','C','C']})


   name  value  weight
0    A    1.0       3
1    A    NaN       1
2    A    3.0       1
3    B    2.0       2
4    B    3.0       1
5    B    1.0       2
6    C    3.0       2
7    C    NaN       1
8    C    NaN       1

I'd like to fill in "NaN" with weighted value in each "name" group, i.e.

   name  value  weight
0    A    1.0       3
1    A    1.5       1
2    A    3.0       1
3    B    2.0       2
4    B    3.0       1
5    B    1.0       2
6    C    3.0       2
7    C    3.0       1
8    C    3.0       1
Chao Chen
  • 23
  • 4

1 Answers1

4

You can group data frame by name, and use fillna method to fill the missing values with weighted average which can calculated with np.average with weights parameter:

df['value'] = (df.groupby('name', group_keys=False)
                 .apply(lambda g: g.value.fillna(np.average(g.dropna().value, weights=g.dropna().weight))))

df
#name   value   weight
#0  A    1.0    3
#1  A    1.5    1
#2  A    3.0    1
#3  B    2.0    2
#4  B    3.0    1
#5  B    1.0    2
#6  C    3.0    2
#7  C    3.0    1
#8  C    3.0    1

To make this less convoluted, define a fillValue function:

import numpy as np
import pandas as pd

def fillValue(g):
    gNotNull = g.dropna()
    wtAvg = np.average(gNotNull.value, weights=gNotNull.weight)
    return g.value.fillna(wtAvg)

df['value'] = df.groupby('name', group_keys=False).apply(fillValue)
Psidom
  • 209,562
  • 33
  • 339
  • 356
  • Thanks for the quick reply! I was trying to use transform and get nowhere. – Chao Chen Jan 21 '17 at 18:56
  • Normally you can use transform if the calculation only involves one single column, but you are calculating weighted averages here. So you need apply to access more than one column. – Psidom Jan 21 '17 at 18:58
  • Yes, exactly. Pretty new to python, so lots to pick up. Thanks! – Chao Chen Jan 21 '17 at 19:00