3

I am trying to fill a pandas dataframe NAN using random data of every column, and that random data appears in every column depeding on its frecuency. I have this:

def MissingRandom(dataframe):
        import random
        dataframe = dataframe.apply(lambda x: x.fillna(
                random.choices(x.value_counts().keys(),
                               weights = list(x.value_counts()))[0]))
    return dataframe

I get the DataFrame filled in with random data but its the same data for all the missing data of the column. I would like this data to be different for every missing of the column but I am not able to do it. Could anybody help me?

Thank you very much

S3DEV
  • 8,768
  • 3
  • 31
  • 42
Deco1998
  • 101
  • 6
  • 1
    It seems (without investigation) that if you're filling `NaN` values with a random value from the column, wouldn't it be much simpler (and as effective) to use `df.fillna(method='ffill')` to forward-fill any empty values? This would also address the frequency, as more common values are likely to be followed by a missing value, than a non-common value. – S3DEV Dec 01 '20 at 19:14

2 Answers2

2

Please see below my solution. Firstly i created a function that fills a series based on your criteria (frequencies as weights in the random function) and finally, we apply this function to all clumns of the dataframe:

from collections import Counter
def fillcolumn(ser):
        cna=len(ser[ser.isna()])
        l=ser[ser.notna()]
        d=Counter(l)    
        m=random.choices(list(d.keys()), weights = list(d.values()), k=cna)
        ser[ser.isna()]=m
        return ser
    
for i in df.columns:
    df[i]=fillcolumn(df[i])

Your full code:

def MissingRandom(dataframe):
    import random
    from collections import Counter
    def fillcolumn(ser):
        cna=len(ser[ser.isna()])
        l=ser[ser.notna()]
        d=Counter(l)    
        m=random.choices(list(d.keys()), weights = list(d.values()), k=cna)
        ser[ser.isna()]=m
        return ser
        
    for i in dataframe.columns:
        dataframe[i]=fillcolumn(dataframe[i])
    return dataframe
IoaTzimas
  • 10,538
  • 2
  • 13
  • 30
0

Here are two thoughts on the (interesting!) subject.

  • Create a replace function and call apply
  • Use fillna(method='ffill')

Replace function:

Setup:

df = pd.DataFrame({'a': [1, np.nan, 3, 4, np.nan],
                   'b': [np.nan, 12, np.nan, np.nan, 15],
                   'c': [21, np.nan, np.nan, 24, 25],
                   'd': [31, np.nan, np.nan, 34, 34]})

Example function:

def replace_na(x):
    """Replace NaN values with values randomly selected from the Series."""
    vc = x.value_counts()
    r = random.choices(vc.keys(), weights=vc.values, k=x.isnull().sum())
    x[x.isnull()] = r
    return x

Apply:

df.apply(lambda x: replace_na(x))

Output:

     a     b     c     d
0  1.0  12.0  21.0  31.0
1  4.0  12.0  25.0  34.0
2  3.0  15.0  21.0  34.0
3  4.0  15.0  24.0  34.0
4  1.0  15.0  25.0  34.0

A different thought:

A different thought process ... as problem solving is about looking at different angles.

I acknowledge that this approach does not meet the OP's specific request - but perhaps meets the underlying intent.

If filling NaN values with a random value from the column, it might be simpler (and equally as effective) to forward-fill the empty values. This would also address the frequency, as more-common values are likely to be followed by a missing value, than a less-common value.

df.fillna(method='ffill')
S3DEV
  • 8,768
  • 3
  • 31
  • 42