0

Just an example... I have a 5M row dataframe full of runners and a log of how many miles they ran on any given day. If a runner doesn't run on a day, there is no 0 entry.

Runner Day Miles
A 1 3
A 3 4
A 8 4
A 9 2

I'm trying to get a field for how much they ran in the last 7 calendar days (not including the current row), as well as the next 7 calendar days (including current row).

Runner Day Miles Last 7 Next 7
A 1 3 null 7
A 3 4 3 10
A 8 4 7 6
A 9 2 10 2

I think can get the answer for a single row.

Next 7 for row 1 for example...

df.loc[(df['Runner'] == 'A') & (df['day'] >= 1) & (df['day'] < 1+7)]['Miles'].sum()

I'm just unsure how to apply this type of thing to every row in an efficient way. The methods I've seen so far only utilize data from the same row. Any help?

Owen Stout
  • 11
  • 1
  • Please clarify your specific problem or provide additional details to highlight exactly what you need. As it's currently written, it's hard to tell exactly what you're asking. – Community Feb 26 '22 at 09:46

1 Answers1

0

you may use rolling sum.

Let's say the maximum day number is 20

max_day = 20

And these are the lists of all runners and all days

import numpy as np
runners = df['Runner'].unique()
days = np.arange(1, max_day)

Then we can get a complete list of runner and day combinations

all_combos = [(r, d) for r in runners for d in days]

In the 1st line of the following 4-line code, we fill the zeros back, then in the 2nd we group the data by runner and in the 3rd we apply rolling sum on Miles requiring at least only 1 record to do a sum. You may adjust this parameter. The last line takes care of the format.

rollingsum = df.set_index(['Runner', 'Day']).reindex(all_combos).fillna(0)\
    .groupby('Runner')\
    ['Miles'].rolling(7, min_periods=1).sum()\
    .droplevel(1).rename('rs').reset_index()

We only need to do 7-day sum once, and shift the records for 'last-7-day' and 'next-7-day' by assigning two new, shifted Day columns to replace the original one, and merge the results.

df.merge(
    rollingsum.assign(Day = rollingsum['Day']+1).rename(columns={'rs': 'last_7_days'}),
    on = ['Runner', 'Day'],
    how = 'left'
).merge(
    rollingsum.assign(Day = rollingsum['Day']-6).rename(columns={'rs': 'next_7_days'}),
    on = ['Runner', 'Day'],
    how = 'left'
)

  Runner  Day  Miles  last_7_days  next_7_days
0      A    1      3          NaN          7.0
1      A    3      4          3.0         10.0
2      A    8      4          7.0          6.0
3      A    9      2          8.0          2.0
Raymond Kwok
  • 2,461
  • 2
  • 9
  • 11