Fill NaN values of DataFrame with random values from the column, depending on frequency

Question

I am trying to fill a pandas dataframe NAN using random data of every column, and that random data appears in every column depeding on its frecuency. I have this:

def MissingRandom(dataframe):
        import random
        dataframe = dataframe.apply(lambda x: x.fillna(
                random.choices(x.value_counts().keys(),
                               weights = list(x.value_counts()))[0]))
    return dataframe

I get the DataFrame filled in with random data but its the same data for all the missing data of the column. I would like this data to be different for every missing of the column but I am not able to do it. Could anybody help me?

Thank you very much

It seems (without investigation) that if you're filling `NaN` values with a random value from the column, wouldn't it be much simpler (and as effective) to use `df.fillna(method='ffill')` to forward-fill any empty values? This would also address the frequency, as more common values are likely to be followed by a missing value, than a non-common value. — S3DEV, Dec 01 '20 at 19:14

score 2 · Answer 1 · answered Dec 01 '20 at 19:00

Please see below my solution. Firstly i created a function that fills a series based on your criteria (frequencies as weights in the random function) and finally, we apply this function to all clumns of the dataframe:

from collections import Counter
def fillcolumn(ser):
        cna=len(ser[ser.isna()])
        l=ser[ser.notna()]
        d=Counter(l)    
        m=random.choices(list(d.keys()), weights = list(d.values()), k=cna)
        ser[ser.isna()]=m
        return ser
    
for i in df.columns:
    df[i]=fillcolumn(df[i])

Your full code:

def MissingRandom(dataframe):
    import random
    from collections import Counter
    def fillcolumn(ser):
        cna=len(ser[ser.isna()])
        l=ser[ser.notna()]
        d=Counter(l)    
        m=random.choices(list(d.keys()), weights = list(d.values()), k=cna)
        ser[ser.isna()]=m
        return ser
        
    for i in dataframe.columns:
        dataframe[i]=fillcolumn(dataframe[i])
    return dataframe

S3DEV · Answer 2 · 2020-12-01T19:44:10.563

Here are two thoughts on the (interesting!) subject.

Create a replace function and call apply
Use fillna(method='ffill')

Replace function:

Setup:

df = pd.DataFrame({'a': [1, np.nan, 3, 4, np.nan],
                   'b': [np.nan, 12, np.nan, np.nan, 15],
                   'c': [21, np.nan, np.nan, 24, 25],
                   'd': [31, np.nan, np.nan, 34, 34]})

Example function:

def replace_na(x):
    """Replace NaN values with values randomly selected from the Series."""
    vc = x.value_counts()
    r = random.choices(vc.keys(), weights=vc.values, k=x.isnull().sum())
    x[x.isnull()] = r
    return x

Apply:

df.apply(lambda x: replace_na(x))

Output:

     a     b     c     d
0  1.0  12.0  21.0  31.0
1  4.0  12.0  25.0  34.0
2  3.0  15.0  21.0  34.0
3  4.0  15.0  24.0  34.0
4  1.0  15.0  25.0  34.0

A different thought:

A different thought process ... as problem solving is about looking at different angles.

I acknowledge that this approach does not meet the OP's specific request - but perhaps meets the underlying intent.

If filling NaN values with a random value from the column, it might be simpler (and equally as effective) to forward-fill the empty values. This would also address the frequency, as more-common values are likely to be followed by a missing value, than a less-common value.

df.fillna(method='ffill')

Fill NaN values of DataFrame with random values from the column, depending on frequency

2 Answers2

Replace function:

A different thought: