4

I've seen this and this thread here, but something else is wrong.

I have a very large pandas DataFrame, with many Na/NaN values. I want to replace them with the median value for that feature.

So, I first make a table that displays the Na values per feature, sorted by most Na values, then use fillna(), and then display that table again. Ideally, the second time, that table should have all 0's, because all the Na's have been filled.

nullCount = pd.DataFrame(TT_df.isnull().sum(),columns=["nullcount"]).sort_values(by="nullcount",ascending=False)
display(nullCount.head(10))

TT_df = TT_df.fillna(TT_df.median())

nullCount = pd.DataFrame(TT_df.isnull().sum(),columns=["nullcount"]).sort_values(by="nullcount",ascending=False)
display(nullCount.head(10))

However, I get these two tables:

null count tables, before and after

and if I take a look at the DataFrame, you can see NaN's in it:

display(TT_df[nullCount.index.tolist()[0:5]].head(50))

NaN examples

It seems like a common problem with fillna() is that it returns a copy, unless you use inplace=True (like in the linked threads above), but I'm not doing that: I'm overwriting TT_df, unless I'm misunderstanding something. You can see that the LotFrontage feature actually does disappear from the second table, implying that the fillna() did work for it. So why isn't it working for the others?

What I suspect is the culprit, though I don't know why, is that Na doesn't actually mean Na for these features: if I look at the data description file, it says:

GarageFinish: Interior finish of the garage

   Fin    Finished
   RFn    Rough Finished  
   Unf    Unfinished
   NA No Garage

Okay, that's fine. But it feels like those NA values should either count as Na for both isnull() and fillna(), or not count for either. Why does it appear to be counted by isnull() but not fillna()?

jpp
  • 159,742
  • 34
  • 281
  • 339
GrundleMoof
  • 289
  • 3
  • 11

1 Answers1

4

The problem is with this line:

TT_df = TT_df.fillna(TT_df.median())

Your dataframe has strings and you are attempting to calculate medians on strings. This doesn't work.

Here's a minimal example:

import pandas as pd, numpy as np

df = pd.DataFrame({'A': ['A', 'B', np.nan, 'B']})

df = df.fillna(df.median())

print(df)

     A
0    A
1    B
2  NaN
3    B

What you should do is fillna with median only for numeric columns:

for col in df.select_dtypes(include=np.number):
    df[col] = df[col].fillna(df[col].median())
jpp
  • 159,742
  • 34
  • 281
  • 339
  • d'oh! Thanks, that makes sense. For some reason I stupidly thought it was replacing it with the *mode*, but that's obviously not the case. Would that be a reasonable "filler" for non numeric features? Also, is there a good reason people usually fill numeric Na's with the median, rather than the mean? thanks! – GrundleMoof May 09 '18 at 19:10
  • 1
    @GrundleMoof, No, `mode` only works for numeric columns too. `median` vs `mean` should be chosen on a case-by-case basis, depending on how you use the data. – jpp May 09 '18 at 19:40
  • 1
    @jpp that's a very smart way of picking `mean` vs `median`. Even for `mode` – look Feb 12 '19 at 23:07