4

I need to propagate a calculation (for example a delay) along pandas dataframe rows.

I found a solution which uses .iterrows() method and is very slow, so I was wondering if there's a vectorized solution for this problem since my data is huge.

Here is my approach:

import pandas as pd
import numpy as np
df = pd.DataFrame(index = ['task_1', 'task_2', 'task_3', 'task_4', 'task_5'], columns=['start_time', 'end_time'], data = [[1,2], [3,4], [6,7], [7,8], [10,11] ] )

# set start delay on task 2
start_delay_on_task_2 = 3
df.loc['task_2', 'start_delay'] = start_delay_on_task_2
df['start_delay'].fillna(0, inplace=True)

# compute buffer between tasks
df['buffer_to_next_task'] = df['start_time'].shift(-1) - df['end_time']

here is the content of df:

        start_time  end_time
task_1  1           2
task_2  3           4
task_3  6           7
task_4  7           8
task_5  10          11

and now the worst code ever to compute the overall delay

df['overall_start_delay'] = df['start_delay']
overall_start_delay_idx = df.columns.get_loc('overall_start_delay')
start_delay_idx = df.columns.get_loc('start_delay')
buffer_to_next_task_idx = df.columns.get_loc('buffer_to_next_task')
for i in range(len(df)):
    overall_delay = None
    if list(df.index)[i] <= 'task_2':
        overall_delay = df.iloc[i, start_delay_idx]
    else:
        overall_delay = max(0, df.iloc[i-1, overall_start_delay_idx] - df.iloc[i-1, buffer_to_next_task_idx])
    df.iloc[i, overall_start_delay_idx] = overall_delay

and here the desired result

         start_time end_time start_delay    buffer_to_next_task overall_start_delay
task_1   1          2        0.0            1.0                 0.0
task_2   3          4        3.0            2.0                 3.0
task_3   6          7        0.0            0.0                 1.0
task_4   7          8        0.0            2.0                 1.0
task_5   10         11       0.0            NaN                 0.0

any suggestion about making this code vectorized and avoid the for loop?

Mehdi
  • 717
  • 1
  • 7
  • 21
user1403546
  • 1,680
  • 4
  • 22
  • 43

1 Answers1

1

This is a solution for one delay:

total_delays = df.start_delay.cumsum()
(total_delays
 .sub(df.buffer_to_next_task
      .where(total_delays.gt(0),0)
      .cumsum().shift(fill_value=0)
     )
   .clip(lower=0)
)

Output:

task_1    0.0
task_2    3.0
task_3    1.0
task_4    1.0
task_5    0.0
dtype: float64
Quang Hoang
  • 146,074
  • 10
  • 56
  • 74
  • brilliant! thanks! would this work also for multiple non-zero start_delays? – user1403546 Feb 27 '20 at 12:20
  • 1
    No. And I doubt there would be a vectorized approach for multiple non-zero `start_delay`. Basically, you want to reset cummulative sum of `buffer_to_next_task` whenever it exceeds some thresholds. There are several questions on SO for even a fixed threshold, and the answer is always a `for` loop. And to be fair `O(n)` isn't that bad for `n ~ 1e6`. – Quang Hoang Feb 27 '20 at 12:53