Filling each row of one column of a DataFrame with different values (a random distribution)

Question

I have a DataFrame with aprox. 4 columns and 200 rows. I created a 5th column with null values:

df['minutes'] = np.nan

Then, I want to fill each row of this new column with random inverse log normal values. The code to generate 1 inverse log normal:

note: if the code bellow is ran multiple times it will generate a new result because of the value inside ppf() : random.random()

df['minutes'] = df['minutes'].fillna(stats.lognorm(0.5, scale=np.exp(1.8)).ppf(random.random()).astype(int))

What's happening when I do that is that it's filling all 200 rows of df['minutes'] with the same number, instead of triggering the random.random() for each row as I expected it to.

What do I have to do? I tried using for loopbut apparently I'm not getting it right (giving the same results):

for i in range(1,len(df)):
df['minutes'] = df['minutes'].fillna(stats.lognorm(0.5, scale=np.exp(1.8)).ppf(random.random()).astype(int))

what am I doing wrong?

Also, I'll add that later I'll need to change some parameters of the inverse log normal above if the value of another column is 0 or 1. as in:

if df['type'] == 0:
     df['minutes'] = df['minutes'].fillna(stats.lognorm(0.5, scale=np.exp(1.8)).ppf(random.random()).astype(int))
elif df['type'] == 1:
     df['minutes'] = df['minutes'].fillna(stats.lognorm(1.2, scale=np.exp(2.7)).ppf(random.random()).astype(int))

thanks in advance.

score 1 · Answer 1 · answered Jul 26 '18 at 15:17

The problem with your use of fillna here is that this function takes a value as argument and applies it to every element along the specified axis. So your stat value is calculated once and then distributed into every row.

What you need is your function called for every element on the axis, so your argument must be the function itself and not a value. That's a job for apply, which takes a function and applies it on elements along an axis.

I'm straight jumping to your final requirements:

You could use apply just on the minutes-column (as a pandas.Series method) with a lambda-function and then assign the respective results to the type-column filtered rows of column minutes:

import numpy as np
import pandas as pd
import scipy.stats as stats
import random

# setup
df = pd.DataFrame(np.random.randint(0, 2, size=(8, 4)),
                  columns=list('ABC') + ['type'])
df['minutes'] = np.nan


df.loc[df.type == 0, 'minutes'] = \
    df['minutes'].apply(lambda _: stats.lognorm(
        0.5, scale=np.exp(1.8)).ppf(random.random()).astype(int),
                    convert_dtype=False))

df.loc[df.type == 1, 'minutes'] = \
    df['minutes'].apply(lambda _: stats.lognorm(
        1.2, scale=np.exp(2.7)).ppf(random.random()).astype(int),
                    convert_dtype=False))

... or you use apply as a DataFrame method with a function wrapping your logic to distinguish between values of type-column and assign the result back to the minutes-column:

def calc_minutes(row):
    if row['type'] == 0:
        return stats.lognorm(0.5, scale=np.exp(1.8)).ppf(random.random()).astype(int)
    elif row['type'] == 1:
        return stats.lognorm(1.2, scale=np.exp(2.7)).ppf(random.random()).astype(int)

df['minutes'] = df.apply(calc_minutes, axis=1)

Thanks! I didn't know about the `apply`, will definetly keep that in mind for future codes. Meanwhile I've managed to do what I needed with a different aproach using NumPy's `append` (submitted and answer bellow) — mrbTT, Jul 26 '18 at 16:11

score 0 · Accepted Answer · answered Jul 26 '18 at 16:07

Managed to do it with some steps with a different mindset:

Created 2 lists, each with i's own parameters

Used NumPy's append so that for each row a different random number

 lognormal_tone = []
 lognormal_ttwo = []
 for i in range(len(s)):
     lognormal_tone.append(stats.lognorm(0.5, scale=np.exp(1.8)).ppf(random.random()).astype(int))
     lognormal_ttwo.append(stats.lognorm(0.4, scale=np.exp(2.7)).ppf(random.random()).astype(int))

Then, included them in the DataFrame with another previously created list:

df = pd.DataFrame({'arrival':arrival,'minTypeOne':lognormal_tone, 'minTypeTwo':lognormal_two})

Filling each row of one column of a DataFrame with different values (a random distribution)

2 Answers2