2

Apologies if this is a repeat, I didn't find a similar answer.

Big picture: I have a df with NaN values which I would like to replace with an imputed median value for that column. However, the built-in imputers in sklearn that I found use the median (or whatever metric) from the entire column. My data has labels and I would like to replace each NaN value with the median value for that column from other samples belonging to that label only.

I can do this by splitting the df into one df for each label, imputing over each of those dfs, and combining, but this logic doesn't scale well. I could have up to 20 classes, and I fundamentally don't believe this is the 'right' way to do it.

I would like to do this without copying my df, by using a groupby object in a split-apply-combine technique (or another technique you think would work). I appreciate your help.

Example df:

   r1  r2   r3   label
0  12  NaN  58       0
1  34  52   24       1
2  32   4  NaN       1
3   7  89    2       0
4  22  19   12       1

Here, I would like the NaN value at (0, r2) to equal the median of that column for label 0, which is the value 89 (from 3, r2).

I would like the NaN value at (2,r3) to equal the median of that column for label 1, which is median(24, 12), or 18.

Example successful result:

   r1  r2   r3   label
0  12   89  58       0
1  34  52   24       1
2  32   4   18       1
3   7  89    2       0
4  22  19   12       1
user7748226
  • 25
  • 1
  • 5

1 Answers1

0
In [158]: df.groupby('label', group_keys=False) \
            .apply(lambda x: x.fillna(x.median()).astype(int))
Out[158]:
   r1  r2  r3  label
0  12  89  58      0
3   7  89   2      0
1  34  52  24      1
2  32   4  18      1
4  22  19  12      1

or using transform:

In [149]: df[['label']].join(df.groupby('label')
                               .transform(lambda x: x.fillna(x.median())).astype(int))
Out[149]:
   label  r1  r2  r3
0      0  12  89  58
1      1  34  52  24
2      1  32   4  18
3      0   7  89   2
4      1  22  19  12
MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
  • this is great, works perfectly, and very elegant. as a note to others, I needed to specify astype(str) since my labels are strings. This converted my numbers to strings, which I converted back to numeric type using df.convert_objects(convert_numeric=True) – user7748226 Jul 26 '17 at 23:23