I am trying to get rid of NaN values in a dataframe. Instead of filling NaN with averages or doing ffill I wanted to fill missing values according to the destribution of values inside a column. In other words, if a column has 120 rows, 20 are NaN, 80 contain 1.0 and 20 contain 0,0, I want to fill 80% of NaN values with 1. Note that the column contains floats.
I made a function to do so:
def fill_cr_hist(x):
if x is pd.np.nan:
r = random.random()
if r > 0.80:
return 0.0
else:
return 1.0
else:
return x
However when I call the function it does not change NaN values.
df['Credit_History'] = df['Credit_History'].apply(fill_cr_hist)
I thied filling NaN values with pd.np.nan, but it didn't change anything.
df['Credit_History'].fillna(value=pd.np.nan, inplace=True)
df['Credit_History'] = df['Credit_History'].apply(fill_cr_hist)
The other function I wrote that is almost identical and works fine. In that case the column contains strings.
def fill_self_emp(x):
if x is pd.np.nan:
r = random.random()
if r > 0.892442:
return 'Yes'
else:
return 'No'
else:
return x