Randomly insert NA's values in a pandas dataframe

Question

How can I randomly insert np.nan's in a DataFrame ? Let's say I want 10% null values inside my DataFrame.

My data looks like this :

df = pd.DataFrame(np.random.randn(5, 3), 
                  index=['a', 'b', 'c', 'd', 'e'],
                  columns=['one', 'two', 'three'])

        one       two     three
a  0.695132  1.044791 -1.059536
b -1.075105  0.825776  1.899795
c -0.678980  0.051959 -0.691405
d -0.182928  1.455268 -1.032353
e  0.205094  0.714192 -0.938242

Is there an easy way to insert the null values?

Kodiologist · Accepted Answer · 2018-02-03T19:01:49.797

Here's a way to clear exactly 10% of cells (or rather, as close to 10% as can be achieved with the existing data frame's size).

import random
ix = [(row, col) for row in range(df.shape[0]) for col in range(df.shape[1])]
for row, col in random.sample(ix, int(round(.1*len(ix)))):
    df.iat[row, col] = np.nan

Here's a way to clear cells independently with a per-cell probability of 10%.

df = df.mask(np.random.random(df.shape) < .1)

Jaroslav Bezděk · Answer 2 · 2021-11-19T11:16:26.280

15

You can easily iterate over data frame columns and assign NaN value to every cell produced by pandas.DataFrame.sample() method.

The code is following.

for col in df.columns:
    df.loc[df.sample(frac=0.1).index, col] = pd.np.nan

edited Nov 19 '21 at 11:16

answered Apr 03 '20 at 18:30

Jaroslav Bezděk

6,967
6
29
46

score 0 · Answer 3 · answered May 08 '21 at 01:27

To add to and modify @Jaroslav Bezděk's code a bit, here is my view. Here, I am assuming that you want to apply the NaNs to numeric variables.

# select only numeric columns to apply the missingness to
cols_list = df.select_dtypes('number').columns.tolist()
        
# randomly remove cases from the dataframe
for col in df[cols_list]:
    df.loc[df.sample(frac=0.05).index, col] = np.nan

Note: if you use pd.np.nan you get ipython-input-5-e9827aa92133>:9: FutureWarning: The pandas.np module is deprecated and will be removed from pandas in a future version. Import numpy directly instead.

I totally agree. It would be better to use `np.nan` instead of `pd.np.nan`. — Jaroslav Bezděk, Nov 19 '21 at 11:19

Randomly insert NA's values in a pandas dataframe

3 Answers3

Linked