Standard deviation with groupby(multiple columns) Pandas

Question

I am working with data from the California Air Resources Board.

site,monitor,date,start_hour,value,variable,units,quality,prelim,name 
5407,t,2014-01-01,0,3.00,PM25HR,Micrograms/Cubic Meter ( ug/m<sup>3</sup> ),0,y,Bombay Beach 
5407,t,2014-01-01,1,1.54,PM25HR,Micrograms/Cubic Meter ( ug/m<sup>3</sup> ),0,y,Bombay Beach 
5407,t,2014-01-01,2,3.76,PM25HR,Micrograms/Cubic Meter ( ug/m<sup>3</sup> ),0,y,Bombay Beach 
5407,t,2014-01-01,3,5.98,PM25HR,Micrograms/Cubic Meter ( ug/m<sup>3</sup> ),0,y,Bombay Beach 
5407,t,2014-01-01,4,8.09,PM25HR,Micrograms/Cubic Meter ( ug/m<sup>3</sup> ),0,y,Bombay Beach 
5407,t,2014-01-01,5,12.05,PM25HR,Micrograms/Cubic Meter ( ug/m<sup>3</sup> ),0,y,Bombay Beach 
5407,t,2014-01-01,6,12.55,PM25HR,Micrograms/Cubic Meter ( ug/m<sup>3</sup> ),0,y,Bombay Beach 
...

df = pd.concat([pd.read_csv(file, header = 0) for file in f]) #merges all files into one dataframe
df.dropna(axis = 0, how = "all", subset = ['start_hour', 'variable'],
          inplace = True) #drops bottom columns without data in them, NaN

df.start_hour = pd.to_timedelta(df['start_hour'], unit = 'h')
df.date = pd.to_datetime(df.date)
df['datetime'] = df.date + df.start_hour
df.drop(columns=['date', 'start_hour'], inplace=True)
df['month'] = df.datetime.dt.month
df['day'] = df.datetime.dt.day
df['year'] = df.datetime.dt.year
df.set_index('datetime', inplace = True)
df =  df.rename(columns={'value':'conc'})

I have multiple years of hourly PM2.5 concentration data and am trying to prepare graphs that show the average monthly concentration over many years (different graphs for each month). Here's an image of the graph I've created thus far. [![Bombay Beach][1]][1] However, I want to add error bars to the average concentration line but I am having issues when attempting to calculate the standard deviation. I've created a new dataframe d_avg that includes the year, month, day, and average concentration of PM2.5; here's some of the data.

d_avg = df.groupby(['year', 'month', 'day'], as_index=False)['conc'].mean()

   year  month  day      conc
0  2014      1    1  9.644583
1  2014      1    2  4.945652
2  2014      1    3  4.345238
3  2014      1    4  5.047917
4  2014      1    5  5.212857
5  2014      1    6  2.095714

After this, I found the monthly average m_avg and created a datetime index to plot datetime vs monthly avg conc (refer above, black line).

m_avg = d_avg.groupby(['year','month'], as_index=False)['conc'].mean()
m_avg['datetime'] = pd.to_datetime(m_avg.year.astype(str) + m_avg.month.astype(str), format='%Y%m') + MonthEnd(1)
[In]: m_avg.head(6)
[Out]:
   year  month      conc   datetime
0  2014      1  4.330985 2014-01-31
1  2014      2  2.280096 2014-02-28
2  2014      3  4.464622 2014-03-31
3  2014      4  6.583759 2014-04-30
4  2014      5  9.069353 2014-05-31
5  2014      6  9.982330 2014-06-30

Now I want to calculate the standard deviation of the d_avg concentration, and I've tried multiple things:

sd = d_avg.groupby(['year', 'month'], as_index=False)['conc'].std()

sd = d_avg.groupby(['year', 'month'], as_index=False)['conc'].agg(np.std)

sd = d_avg['conc'].apply(lambda x: x.std())

However, each attempt has left me with the same error in the dataframe. I am unable to plot the standard deviation because I believe it is taking the standard deviation of the year and month too, which I am trying to group the data by. Here's what my resulting dataframe sd looks like:

        year     month        sd
0  44.877611  1.000000  1.795868
1  44.877611  1.414214  2.355055
2  44.877611  1.732051  2.597531
3  44.877611  2.000000  2.538749
4  44.877611  2.236068  5.456785
5  44.877611  2.449490  3.315546

Please help me! [1]: https://i.stack.imgur.com/ueVrG.png

DV82XL · Answer 1 · 2020-08-10T23:03:53.877

I tried to reproduce your error and it works fine for me. Here's my complete code sample, which is pretty much exactly the same as yours EXCEPT for the generation of the original dataframe. So I'd suspect that part of the code. Can you provide the code that creates the dataframe?

import pandas as pd

columns = ['year', 'month', 'day', 'conc']
data = [[2014, 1, 1, 2.0],
        [2014, 1, 1, 4.0],
        [2014, 1, 2, 6.0],
        [2014, 1, 2, 8.0],
        [2014, 2, 1, 2.0],
        [2014, 2, 1, 6.0],
        [2014, 2, 2, 10.0],
        [2014, 2, 2, 14.0]]

df = pd.DataFrame(data, columns=columns)
d_avg = df.groupby(['year', 'month', 'day'], as_index=False)['conc'].mean()
m_avg = d_avg.groupby(['year', 'month'], as_index=False)['conc'].mean()
m_std = d_avg.groupby(['year', 'month'], as_index=False)['conc'].std()

print(f'Concentrations:\n{df}\n')
print(f'Daily Average:\n{d_avg}\n')
print(f'Monthly Average:\n{m_avg}\n')
print(f'Standard Deviation:\n{m_std}\n')

Outputs:

Concentrations:
   year  month  day  conc
0  2014      1    1   2.0
1  2014      1    1   4.0
2  2014      1    2   6.0
3  2014      1    2   8.0
4  2014      2    1   2.0
5  2014      2    1   6.0
6  2014      2    2  10.0
7  2014      2    2  14.0

Daily Average:
   year  month  day  conc
0  2014      1    1   3.0
1  2014      1    2   7.0
2  2014      2    1   4.0
3  2014      2    2  12.0

Monthly Average:
   year  month  conc
0  2014      1   5.0
1  2014      2   8.0

Monthly Standard Deviation:
   year  month      conc
0  2014      1  2.828427
1  2014      2  5.656854

This doesn't really help me... do you have any idea why my code would be taking the standard deviation of the month and year? I am calculating the standard deviate with the intention of plotting error bars on my graphs, but I cannot do this if my year and month columns are incorrect. — obscuredbyclouds, Aug 10 '20 at 19:22
@HeatherLieb I tried to reproduce the error and couldn't. Without seeing a minimal code sample that reproduces the problem, I'd just be guessing. Can you provide the code that generates the original dataframe, `df`? How else does your code differ from mine? — DV82XL, Aug 10 '20 at 23:06
@HeatherLieb also what version of Python and Pandas are you using (type `python -V` from a terminal and `pandas.__version__ from a python console)`? — DV82XL, Aug 10 '20 at 23:08
I found a way to cheat the issue by building a new dataframe, but I'm using the latest version of python on spyder. — obscuredbyclouds, Aug 11 '20 at 04:05
@HeatherLieb Glad you found a workaround. I figured the problem was with the dataframe. If you want to fix it properly, feel free to post the code that creates the original `df`. — DV82XL, Aug 11 '20 at 04:53

score -1 · Accepted Answer · answered Aug 12 '20 at 03:55

I decided to dance around my issue since I couldn't figure out what was causing the problem. I merged the m_avg and sd dataframes and dropped the year and month columns that were causing me issues. See code below, lots of renaming.

d_avg = df.groupby(['year', 'month', 'day'], as_index=False)['conc'].mean()
m_avg = d_avg.groupby(['year','month'], as_index=False)['conc'].mean()
sd = d_avg.groupby(['year', 'month'], as_index=False)['conc'].std(ddof=0) 
sd = sd.rename(columns={"conc":"sd", "year":"wrongyr", "month":"wrongmth"})
m_avg_sd = pd.concat([m_avg, sd], axis = 1)
m_avg_sd.drop(columns=['wrongyr', 'wrongmth'], inplace = True)
m_avg_sd['datetime'] = pd.to_datetime(m_avg_sd.year.astype(str) + m_avg_sd.month.astype(str), format='%Y%m') + MonthEnd(1)

and here's the new dataframe:

m_avg_sd.head(5)
Out[2]: 
   year  month       conc         sd   datetime
0  2009      1  48.350105  18.394192 2009-01-31
1  2009      2  21.929383  16.293645 2009-02-28
2  2009      3  15.094729   6.821124 2009-03-31
3  2009      4  12.021009   4.391219 2009-04-30
4  2009      5  13.449100   4.081734 2009-05-31

Standard deviation with groupby(multiple columns) Pandas

2 Answers2