3

I'm working with the following DataFrame containing .str values

maturity_rating
0   NaN
1   Rated: 18+ (R)
2   Rated: 7+ (PG)
3   NaN
4   Rated: 18+ (R)

and I'm trying to fill the NaN values randomly with other Non-Null values present in the same column

My expected output is:

maturity_rating
0   Rated: 7+ (PG)
1   Rated: 18+ (R)
2   Rated: 7+ (PG)
3   Rated: 18+ (R)
4   Rated: 18+ (R)

I tried using the following snippet

df["maturity_rating"].fillna(lambda x: random.choice(df[df['maturity_rating'] != np.nan]["maturity_rating"]), inplace =True)

However when I check for unique values, it fills NaN with a lambda object


df["maturity_rating"].unique()

Out[117]:
array([<function <lambda> at 0x7fe8d0431a60>, 'Rated: 18+ (R)',
       'Rated: 7+ (PG)', 'Rated: 13+ (PG-13)', 'Rated: All (G)',
       'Rated: 16+'], dtype=object)

Please Advise

The Singularity
  • 2,428
  • 3
  • 19
  • 48
  • [This comment has the working answer](https://stackoverflow.com/questions/36413314/filling-missing-data-by-random-choosing-from-non-missing-values-in-pandas-datafr#comment60443463_36413698) – mck Feb 28 '21 at 07:14
  • Is there some rule as to how we fill na? YOu can look into pd.fillna and check the method argument for ffill, bfill etc.. – Aditya Feb 28 '21 at 07:15
  • @mck I've tried that method but it's got it's own issues for string values – The Singularity Feb 28 '21 at 07:18

1 Answers1

4

Let us try np.random.choice:

m = df['maturity_rating'].isna()
df.loc[m, 'maturity_rating'] = np.random.choice(df.loc[~m, 'maturity_rating'], m.sum())

Details:

Create a boolean mask using Series.isna which specifies the condition where maturity_column contains NaN values:

>>> m

0     True
1    False
2    False
3     True
4    False
Name: maturity_rating, dtype: bool

Use boolean indexing with inverted mask m to select the non NaN elements from maturity_rating column then use np.random.choice to randomly sample these elements:

>>> df.loc[~m, 'maturity_rating']

1    Rated: 18+ (R)
2    Rated: 7+ (PG)
4    Rated: 18+ (R)
Name: maturity_rating, dtype: object

>>> np.random.choice(df.loc[~m, 'maturity_rating'], m.sum())

array(['Rated: 18+ (R)', 'Rated: 7+ (PG)'], dtype=object)

Finally use boolean indexing to fill the NaN values in the maturity_rating column with the above sampled choices:

>>> df

  maturity_rating
0  Rated: 18+ (R)
1  Rated: 18+ (R)
2  Rated: 7+ (PG)
3  Rated: 18+ (R)
4  Rated: 18+ (R)
Shubham Sharma
  • 68,127
  • 6
  • 24
  • 53