8

I have a dataframe having 4 columns(A,B,C,D). D has some NaN entries. I want to fill the NaN values by the average value of D having same value of A,B,C.

For example,if the value of A,B,C,D are x,y,z and Nan respectively,then I want the NaN value to be replaced by the average of D for the rows where the value of A,B,C are x,y,z respectively.

Abhisek Dash
  • 93
  • 1
  • 1
  • 5

3 Answers3

19

df['D'].fillna(df.groupby(['A','B','C'])['D'].transform('mean')) would be faster than apply

In [2400]: df
Out[2400]:
   A  B  C    D
0  1  1  1  1.0
1  1  1  1  NaN
2  1  1  1  3.0
3  3  3  3  5.0

In [2401]: df['D'].fillna(df.groupby(['A','B','C'])['D'].transform('mean'))
Out[2401]:
0    1.0
1    2.0
2    3.0
3    5.0
Name: D, dtype: float64

In [2402]: df['D'] = df['D'].fillna(df.groupby(['A','B','C'])['D'].transform('mean'))

In [2403]: df
Out[2403]:
   A  B  C    D
0  1  1  1  1.0
1  1  1  1  2.0
2  1  1  1  3.0
3  3  3  3  5.0

Details

In [2396]: df.shape
Out[2396]: (10000, 4)

In [2398]: %timeit df['D'].fillna(df.groupby(['A','B','C'])['D'].transform('mean'))
100 loops, best of 3: 3.44 ms per loop


In [2397]: %timeit df.groupby(['A','B','C'])['D'].apply(lambda x: x.fillna(x.mean()))
100 loops, best of 3: 5.34 ms per loop
Zero
  • 74,117
  • 18
  • 147
  • 154
  • Hi @zero, I am trying to use your method for planets dataset, but it's not imputing all the values. Not sure why: https://stackoverflow.com/questions/73449902/fill-in-missing-values-with-groupby/73450241 – Roy Aug 26 '22 at 18:58
9

I think you need:

df.D = df.groupby(['A','B','C'])['D'].apply(lambda x: x.fillna(x.mean()))

Sample:

df = pd.DataFrame({'A':[1,1,1,3],
                   'B':[1,1,1,3],
                   'C':[1,1,1,3],
                   'D':[1,np.nan,3,5]})

print (df)
   A  B  C    D
0  1  1  1  1.0
1  1  1  1  NaN
2  1  1  1  3.0
3  3  3  3  5.0

df.D = df.groupby(['A','B','C'])['D'].apply(lambda x: x.fillna(x.mean()))
print (df)
   A  B  C    D
0  1  1  1  1.0
1  1  1  1  2.0
2  1  1  1  3.0
3  3  3  3  5.0
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
2

Link to duplicate of this question for further information: Pandas Dataframe: Replacing NaN with row average

Another suggested way of doing it mentioned in the link is using a simple fillna on the transpose: df.T.fillna(df.mean(axis=1)).T

Community
  • 1
  • 1