Scaleable Python normal distribution from pandas DataFrame

Question

I have a pandas dataframe (code below) that has the mean and std deviation by day of week and quarter. What i'd like to do is extract each mean and std deviation by day of week, create a random normal sample from those two values then plot it.

np.random.seed(42)
day_of_week=['mon', 'tues', 'wed', 'thur', 'fri', 'sat','sun']
year=[2017]
qtr=[1,2,3,4]
mean=np.random.uniform(5,30,len(day_of_week)*len(qtr))
std=np.random.uniform(1,10,len(day_of_week)*len(qtr))

dat=pd.DataFrame({'year':year*(len(day_of_week)*len(qtr)),
             'qtr':qtr*len(day_of_week),
             'day_of_week':day_of_week*len(qtr),
             'mean':mean,
             'std': std})
dowuq=dat.day_of_week.unique()

Right now i have a solution to the above which works but this method isn't very scaleable. If I wanted to add in more and more columns i.e another year or break it out by week this would not but efficient. I'm fairly new to python so any help is appreciated.

Code that works but not scaleable:

plt.style.use('fivethirtyeight')
for w in dowuq:
    datsand=dat[dat['day_of_week']==''+str(w)+''][0:4]
    mu=datsand.iloc[0]['mean']
    sigma=datsand.iloc[0]['std']
    mu2=datsand.iloc[1]['mean']
    sigma2=datsand.iloc[1]['std']
    mu3=datsand.iloc[2]['mean']
    sigma3=datsand.iloc[2]['std']
    mu4=datsand.iloc[3]['mean']
    sigma4=datsand.iloc[3]['std']             
    s1=np.random.normal(mu, sigma, 2000)
    s2=np.random.normal(mu2, sigma2, 2000)
    s3=np.random.normal(mu3, sigma3, 2000)
    s4=np.random.normal(mu4, sigma4, 2000)
    sns.kdeplot(s1, bw='scott', label='Q1')
    sns.kdeplot(s2, bw='scott', label='Q2')
    sns.kdeplot(s3, bw='scott', label='Q3')
    sns.kdeplot(s4, bw='scott', label='Q4')
    plt.title(''+str(w)+' in 2017')
    plt.ylabel('Density')
    plt.xlabel('Random')
    plt.xticks(rotation=15)
    plt.show()

score 1 · Accepted Answer · answered Jul 21 '17 at 13:55

You should probably be using groupby, which allows you to group a dataframe. For the time being we group on 'day' only, but you could extend this in future if required.

We can also change to using iterrows to loop over all of the listed rows:

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

np.random.seed(42)
day_of_week = ['mon', 'tues', 'wed', 'thur', 'fri', 'sat', 'sun']
year = [2017]
qtr = [1, 2, 3, 4]
mean = np.random.uniform(5, 30, len(day_of_week) * len(qtr))
std = np.random.uniform(1, 10, len(day_of_week) * len(qtr))

dat = pd.DataFrame({'year': year * (len(day_of_week) * len(qtr)),
                    'qtr': qtr * len(day_of_week),
                    'day_of_week': day_of_week * len(qtr),
                    'mean': mean,
                    'std': std})

# Group by day of the week
for day, values in dat.groupby('day_of_week'):
    # Loop over rows for each day of the week
    for i, r in values.iterrows():
        cur_dist = np.random.normal(r['mean'], r['std'], 2000)
        sns.kdeplot(cur_dist, bw='scott', label='{}_Q{}'.format(day, r['qtr']))
    plt.title('{} in 2017'.format(day))
    plt.ylabel('Density')
    plt.xlabel('Random')
    plt.xticks(rotation=15)
    plt.show()
    plt.clf()

Thanks for this. For my own clarification, data is already at the day of week level why would you have to groupby the day of week? — P.Cummings, Jul 21 '17 at 14:23
grouping by day of the week effectively combines what you refer to as `dowuq` and `datsand`. For each unique value in the `'day_of_week'` column, `groupby` provides a dataframe consisting of only the rows that match that value. You could try printing `values` inside the first `for` loop to see this more clearly. — asongtoruin, Jul 21 '17 at 14:30
@P.Cummings you're welcome! If it helped you, you can mark it as the answer and/or vote up using the buttons to the left of the answer, which should help people with similar problems find it in future. — asongtoruin, Jul 21 '17 at 14:52

Scaleable Python normal distribution from pandas DataFrame

1 Answers1