12

I have the following dataframe:

import numpy as np 
import pandas as pd
df = pd.DataFrame(data={'Cat' : ['A', 'A', 'A','B', 'B', 'A', 'B'],
                        'Vals' : [1, 2, 3, 4, 5, np.nan, np.nan]})

    Cat   Vals
0   A     1
1   A     2
2   A     3
3   B     4
4   B     5
5   A   NaN
6   B   NaN

And I want indexes 5 and 6 to be filled with the conditional mean of 'Vals' based on the 'Cat' column, namely 2 and 4.5

The following code works fine:

means = df.groupby('Cat').Vals.mean()
for i in df[df.Vals.isnull()].index:
    df.loc[i, 'Vals'] = means[df.loc[i].Cat]

    Cat   Vals
0   A     1
1   A     2
2   A     3
3   B     4
4   B     5
5   A     2
6   B   4.5

But I'm looking for something nicer, like

df.Vals.fillna(df.Vals.mean(Conditionally to column 'Cat'))

Edit: I found this, which is one line shorter, but I'm still not happy with it:

means = df.groupby('Cat').Vals.mean()
df.Vals = df.apply(lambda x: means[x.Cat] if pd.isnull(x.Vals) else x.Vals, axis=1)
Kartik
  • 8,347
  • 39
  • 73
Niourf
  • 450
  • 1
  • 4
  • 15

1 Answers1

7

We wish to "associate" the Cat values with the missing NaN locations. In Pandas such associations are always done via the index. So it is natural to set Cat as the index:

df = df.set_index(['Cat'])

Once this is done, then fillna works as desired:

df['Vals'] = df['Vals'].fillna(means)

To return Cat to a column, you could then of course use reset_index:

df = df.reset_index()

import pandas as pd
import numpy as np
df = pd.DataFrame(
    {'Cat' : ['A', 'A', 'A','B', 'B', 'A', 'B'], 
     'Vals' : [1, 2, 3, 4, 5, np.nan, np.nan]})

means = df.groupby(['Cat'])['Vals'].mean()
df = df.set_index(['Cat'])
df['Vals'] = df['Vals'].fillna(means)
df = df.reset_index()
print(df)

yields

  Cat  Vals
0   A   1.0
1   A   2.0
2   A   3.0
3   B   4.0
4   B   5.0
5   A   2.0
6   B   4.5
unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
  • 12
    Not worth a separate answer, but `df["Vals"].fillna(df.groupby("Cat")["Vals"].transform("mean"))` would avoid the need to change index. – DSM Oct 31 '15 at 23:07
  • 1
    I am wondering why you set `Cat` as the index? It seems df['Vals'] = df['Vals'].fillna(means) will produce the same result without setting and resetting the index. – Joe T. Boka Oct 31 '15 at 23:18
  • 1
    @JoeR: It won't because Cat takes values 'A' and 'B'. The asker wants to fill `nan` against A (or B) with the mean obtained from the values against A (or B) – Kartik Nov 01 '15 at 00:44
  • @Kartik Just run the code and you will see, the results are the same. – Joe T. Boka Nov 01 '15 at 00:57
  • @DSM That's exactly what I was looking for (and it's faster), I didn't know the use of `transform`. Thanks! – Niourf Nov 01 '15 at 09:47
  • @DSM: Great answer! Sorry by mistake i removed upvote, can't upvote again. – r_hudson May 28 '19 at 22:20
  • @DSM your comment is worth a separate answer, very helpful! – charelf May 01 '21 at 14:37