0

I have a dataframe with the following structure:

import pandas as pd
import numpy as np

names = ['PersonA', 'PersonB', 'PersonC', 'PersonD','PersonE','PersonF']
team = ['Team1','Team2']
dates = pd.date_range(start = '2020-05-28', end = '2021-11-22')

df = pd.DataFrame({'runtime': np.repeat(dates, len(names)*len(team))})
df['name'] = len(dates)*len(team)*names
df['team'] = len(dates)*len(names)*team
df['A'] = 40 + 20*np.random.random(len(df))
df['B'] = .1 * np.random.random(len(df))
df['C'] = 1 +.5 * np.random.random(len(df))

I would like to create a dataframe that displays calculated mean values for the runtime in periods such as the previous Week, Month, Yearl, and All-Time such that it looks like this:

name | team | A_w | B_w | C_w| A_m | B_m | C_m | A_y | B_y | C_y | A_at | B_at | C_at

I have successfully added a calculated column for the mean value using the lamda method described here: How do I create a new column from the output of pandas groupby().sum()?

e.g.: df = df.groupby(['name','team'], as_index=True).apply(lambda gdf: gdf.assign(A_at=lambda gdf: gdf['A'].mean()))

My output gives me an additional column:

   runtime     name   team          A         B         C       A_at
0    2020-05-28  PersonA  Team1  55.608186  0.027767  1.311662  49.957820
1    2020-05-28  PersonB  Team2  43.481041  0.038685  1.144240  50.057015
2    2020-05-28  PersonC  Team1  47.277667  0.012190  1.047263  50.151846
3    2020-05-28  PersonD  Team2  41.995354  0.040623  1.087151  50.412061
4    2020-05-28  PersonE  Team1  49.824062  0.036805  1.416110  50.073381
...         ...      ...    ...        ...       ...       ...        ...
6523 2021-11-22  PersonB  Team2  46.799963  0.069523  1.322076  50.057015
6524 2021-11-22  PersonC  Team1  48.851620  0.007291  1.473467  50.151846
6525 2021-11-22  PersonD  Team2  49.711142  0.051443  1.044063  50.412061
6526 2021-11-22  PersonE  Team1  57.074027  0.095908  1.464404  50.073381
6527 2021-11-22  PersonF  Team2  41.372381  0.059240  1.132346  50.094965

[6528 rows x 7 columns]

But this is where it gets messy...

I don't need the runtime column, and I am unsure about how to clean this up so that it only lists the 'name' & 'team' columns, additionally... the way I have been producing my source dataframe(s) is by recreating the entire dataframe using a for loop for each time-period with:

for pt in runtimes[:d]:
  <insert dataframe creation for d# of runtimes>
  if d==7:
    dfw = df.groupby(['name','team'], as_index=True).apply(lambda gdf: gdf.assign(A_w=lambda gdf: gdf['A'].mean()))
  if d==30:
    dfm = df.groupby(['name','team'], as_index=True).apply(lambda gdf: gdf.assign(A_m=lambda gdf: gdf['A'].mean()))

I am then attempting to concatenate the outputs like so:

dfs = pd.concat([dfw, dfm])

This works "OK" when d < 30, but when I'm looking at 90-100 days, it creates a dataframe with 50000+ rows and concats it with each other dataframe. Is there a way to perform this operation for x# of previous runtime values in-place?

Any tips on how to make this more efficient would be greatly appreciated.

p3hndrx
  • 103
  • 9

1 Answers1

0

An update... I have been able to formulate a decent output by doing the following:

dfs = pd.DataFrame(columns=['name','team'])
for pt in runtimes[:d]:
    if d == 7:
      df = <insert dataframe creation for d# of runtimes>
      dfw = df.groupby(['name', 'team'], as_index=True).apply(lambda gdf: gdf.assign(A_w=lambda gdf: gdf['A'].mean()))
      ...
      dfw = dfw[['name', 'A_w','B_w','C_w','team']]
      dfs = pd.merge(dfs, dfw, how='inner', on=['name', 'team'])
    if d == 30:
      df = <insert dataframe creation for d# of runtimes>
      dfm = df.groupby(['name', 'team'], as_index=True).apply(lambda gdf: gdf.assign(A_m=lambda gdf: gdf['A'].mean()))
      ...
      dfm = dfm[['name', 'A_m','B_m','C_m','team']]
      dfs = pd.merge(dfs, dfm, how='inner', on=['name', 'team'])

This gives me the output that I am expecting.

p3hndrx
  • 103
  • 9