4

The title may not be the most informative.

I have the following working code I want to vectorize [no for loops] using native pandas.
Basically, it should return for each row its cumulative offset from 0, given the length of each segment, and a relative offset within that segment.

import pandas as pd
import numpy as np

df = pd.DataFrame({"id":     [0, 1,  2,  2,  2,  3,  3,  4,  5,  6,  6,   7,   9],  # notice no 8
                   "length": [0, 10, 20, 20, 20, 30, 30, 40, 50, 60, 60,  70,  90],
                   "offset": [0, 0,  1,  3,  4,  0,  7,  0,  0,  0,  1,   0,   0]})


result = np.zeros((len(df),))
current_abs = df.loc[0, "id"]
for i in range(1, len(df)):
    if current_abs == df.loc[i, "id"]:
        result[i] = result[i - 1]
    else:
        current_abs = df.loc[i, "id"]
        result[i] = result[i - 1] + df.loc[i, "length"]

df["offset_from_start"] = result + df["offset"]

print(df)
    id  length  offset  offset_from_start
0    0       0       0                  0
1    1      10       0                 10
2    2      20       1                 31
3    2      20       3                 33
4    2      20       4                 34
5    3      30       0                 60
6    3      30       7                 67
7    4      40       0                100
8    5      50       0                150
9    6      60       0                210
10   6      60       1                211
11   7      70       0                280
12   9      90       0                370

This seems like a fancy cumsum operation, but I don't know how to do it efficiently.

Gulzar
  • 23,452
  • 27
  • 113
  • 201

2 Answers2

3

Let's try mask on the duplicated, then cumsum:

df['offset_from_start'] = (df['length'].mask(df.duplicated('id'),0)
                                       .cumsum() + df['offset']
                          )

Output:

    id  length  offset  offset_from_start
0    0       0       0                  0
1    1      10       0                 10
2    2      20       1                 31
3    2      20       3                 33
4    2      20       4                 34
5    3      30       0                 60
6    3      30       7                 67
7    4      40       0                100
8    5      50       0                150
9    6      60       0                210
10   6      60       1                211
11   7      70       0                280
12   9      90       0                370
Quang Hoang
  • 146,074
  • 10
  • 56
  • 74
1

Another approach, same principle:

df['offset_from_start'] = (~df['id'].duplicated() * df['length']).cumsum() + df['offset']
print(df)

Output

    id  length  offset  offset_from_start
0    0       0       0                  0
1    1      10       0                 10
2    2      20       1                 31
3    2      20       3                 33
4    2      20       4                 34
5    3      30       0                 60
6    3      30       7                 67
7    4      40       0                100
8    5      50       0                150
9    6      60       0                210
10   6      60       1                211
11   7      70       0                280
12   9      90       0                370

Below are the timings of each approach:

%timeit fun_dani_duplicated(df2)
647 µs ± 49.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit fun_quang_hoang(df3)
1.31 ms ± 264 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Dani Mesejo
  • 61,499
  • 6
  • 49
  • 76
  • can you explain why this method is faster? Will it also be faster for large dfs? – Gulzar Dec 02 '20 at 21:30
  • @Gulzar my guess is that there is some hidden complexity in the mask method. I cannot guarantee that is going to be faster for larger dfs. – Dani Mesejo Dec 02 '20 at 21:37