0

I am trying to get rid of NaN values in a dataframe. Instead of filling NaN with averages or doing ffill I wanted to fill missing values according to the destribution of values inside a column. In other words, if a column has 120 rows, 20 are NaN, 80 contain 1.0 and 20 contain 0,0, I want to fill 80% of NaN values with 1. Note that the column contains floats.

I made a function to do so:

def fill_cr_hist(x):
    if x is pd.np.nan:
        r = random.random()
        if r > 0.80:
            return 0.0
        else:
            return 1.0
    else:
        return x

However when I call the function it does not change NaN values.

df['Credit_History'] = df['Credit_History'].apply(fill_cr_hist)

I thied filling NaN values with pd.np.nan, but it didn't change anything.

df['Credit_History'].fillna(value=pd.np.nan, inplace=True)
df['Credit_History'] = df['Credit_History'].apply(fill_cr_hist)

The other function I wrote that is almost identical and works fine. In that case the column contains strings.

def fill_self_emp(x):
    if x is pd.np.nan:
        r = random.random()
        if r > 0.892442:
            return 'Yes'
        else:
            return 'No'
    else:
        return x
ctrl-alt-delor
  • 7,506
  • 5
  • 40
  • 52

1 Answers1

0
ser = pd.Series([
    1, 1, np.nan, 0, 0, 1, np.nan, 1, 1, np.nan, 0, 0, np.nan])

Use value_counts with normalize=True to get a list of probabilities corresponding to your values. Then generate values randomly according to the given probability distribution and use fillna to fill NaNs.

p = ser.value_counts(normalize=True).sort_index().tolist()   
u = np.sort(ser.dropna().unique())
ser = ser.fillna(pd.Series(np.random.choice(u, len(ser), p=p)))

This solution should work for any number of numeric/categorical values, not just 0s and 1s. If data is a string type, use pd.factorize and convert to numeric.


Details

First, compute the probability distribution:

ser.value_counts(normalize=True).sort_index()

0.0    0.444444
1.0    0.555556
dtype: float64

Get a list of unique values, sorted in the same way:

np.sort(ser.dropna().unique())
array([0., 1.])

Finally, generate random values with specified probability distribution.

pd.Series(np.random.choice(u, len(ser), p=p))

0     0.0
1     0.0
2     1.0
3     0.0
4     0.0
5     0.0
6     1.0
7     1.0
8     0.0
9     0.0
10    1.0
11    0.0
12    1.0
dtype: float64
cs95
  • 379,657
  • 97
  • 704
  • 746