4

I have a DataFrame, df, containing several columns. Some of the values in df are NaN. I want to replace each NaN with a valid value, chosen by randomly sampling from other values in the given column.

For instance, if:

df[work] = [4, 7, NaN, 4]

I'd like to replace df[work][2] with 4 2/3 of the time and 7 1/3 of the time.

Here's my attempt:

def resample_fillna(df):
    for col in df.columns:
        # get series consisting of non-NaN values
        valid_series = df[col].dropna()
        nan_indices = np.argwhere(np.isnan(df[col]))
        for nan_index in nan_indices:
            df[col][nan_index] = valid_series.sample(n=1)

I'm thinking there's a much better, more Pythonic way. Any thoughts?

Thanks!

anon_swe
  • 8,791
  • 24
  • 85
  • 145
  • Do you to replace all the missing values with the same random value or a different random value for each? – Ted Petrou Sep 24 '17 at 13:09
  • This transformer does that automatically: https://feature-engine.readthedocs.io/en/latest/api_doc/imputation/RandomSampleImputer.html – Sole Galli Jan 15 '23 at 17:39

1 Answers1

8

Let's create some fake data and then fill the missing values with random other values from the same column.

np.random.seed(123)
data = np.random.randint(0, 10, (10,5))
df = pd.DataFrame(data, columns=list('abcde'))
df = df.where(df > 2)
df

     a    b    c    d    e
0  NaN  NaN  6.0  NaN  3.0
1  9.0  6.0  NaN  NaN  NaN
2  9.0  NaN  NaN  9.0  3.0
3  4.0  NaN  NaN  4.0  NaN
4  7.0  3.0  NaN  4.0  7.0
5  NaN  4.0  8.0  NaN  7.0
6  9.0  3.0  4.0  6.0  NaN
7  5.0  6.0  NaN  NaN  8.0
8  3.0  5.0  NaN  NaN  6.0
9  NaN  4.0  4.0  6.0  3.0

Now we can loop through each column with apply and sample with replacement from the non-missing values.

df.apply(lambda x: np.where(x.isnull(), x.dropna().sample(len(x), replace=True), x))

     a    b    c    d    e
0  5.0  3.0  6.0  6.0  3.0
1  9.0  6.0  4.0  9.0  7.0
2  9.0  5.0  8.0  9.0  3.0
3  4.0  3.0  8.0  4.0  6.0
4  7.0  3.0  4.0  4.0  7.0
5  9.0  4.0  8.0  6.0  7.0
6  9.0  3.0  4.0  6.0  3.0
7  5.0  6.0  4.0  4.0  8.0
8  3.0  5.0  4.0  4.0  6.0
9  9.0  4.0  4.0  6.0  3.0
Ted Petrou
  • 59,042
  • 19
  • 131
  • 136
  • Why do you use `len(x)` within `sample` instead of just doing `n=1`? – anon_swe Sep 24 '17 at 01:45
  • 2
    @bclayman if you sample n = 1. This will pull one value from your set and place one value in all the NaN instead of sampling the set one time for each nan. For example in the given setup by Ted, the first column would get the same value for all three NaN if you did n=1 instead of n=len(x). – Scott Boston Sep 24 '17 at 04:18