I have a dataframe with the following structure:
import pandas as pd
import numpy as np
names = ['PersonA', 'PersonB', 'PersonC', 'PersonD','PersonE','PersonF']
team = ['Team1','Team2']
dates = pd.date_range(start = '2020-05-28', end = '2021-11-22')
df = pd.DataFrame({'runtime': np.repeat(dates, len(names)*len(team))})
df['name'] = len(dates)*len(team)*names
df['team'] = len(dates)*len(names)*team
df['A'] = 40 + 20*np.random.random(len(df))
df['B'] = .1 * np.random.random(len(df))
df['C'] = 1 +.5 * np.random.random(len(df))
I would like to create a dataframe that displays calculated mean values for the runtime in periods such as the previous Week, Month, Yearl, and All-Time such that it looks like this:
name | team | A_w | B_w | C_w| A_m | B_m | C_m | A_y | B_y | C_y | A_at | B_at | C_at
I have successfully added a calculated column for the mean value using the lamda method described here: How do I create a new column from the output of pandas groupby().sum()?
e.g.:
df = df.groupby(['name','team'], as_index=True).apply(lambda gdf: gdf.assign(A_at=lambda gdf: gdf['A'].mean()))
My output gives me an additional column:
runtime name team A B C A_at
0 2020-05-28 PersonA Team1 55.608186 0.027767 1.311662 49.957820
1 2020-05-28 PersonB Team2 43.481041 0.038685 1.144240 50.057015
2 2020-05-28 PersonC Team1 47.277667 0.012190 1.047263 50.151846
3 2020-05-28 PersonD Team2 41.995354 0.040623 1.087151 50.412061
4 2020-05-28 PersonE Team1 49.824062 0.036805 1.416110 50.073381
... ... ... ... ... ... ... ...
6523 2021-11-22 PersonB Team2 46.799963 0.069523 1.322076 50.057015
6524 2021-11-22 PersonC Team1 48.851620 0.007291 1.473467 50.151846
6525 2021-11-22 PersonD Team2 49.711142 0.051443 1.044063 50.412061
6526 2021-11-22 PersonE Team1 57.074027 0.095908 1.464404 50.073381
6527 2021-11-22 PersonF Team2 41.372381 0.059240 1.132346 50.094965
[6528 rows x 7 columns]
But this is where it gets messy...
I don't need the runtime column, and I am unsure about how to clean this up so that it only lists the 'name' & 'team' columns, additionally... the way I have been producing my source dataframe(s) is by recreating the entire dataframe using a for loop for each time-period with:
for pt in runtimes[:d]:
<insert dataframe creation for d# of runtimes>
if d==7:
dfw = df.groupby(['name','team'], as_index=True).apply(lambda gdf: gdf.assign(A_w=lambda gdf: gdf['A'].mean()))
if d==30:
dfm = df.groupby(['name','team'], as_index=True).apply(lambda gdf: gdf.assign(A_m=lambda gdf: gdf['A'].mean()))
I am then attempting to concatenate the outputs like so:
dfs = pd.concat([dfw, dfm])
This works "OK" when d < 30, but when I'm looking at 90-100 days, it creates a dataframe with 50000+ rows and concats it with each other dataframe. Is there a way to perform this operation for x# of previous runtime values in-place?
Any tips on how to make this more efficient would be greatly appreciated.