1

The output column below is what I'm trying to calculate and the diffs column is an explanation of the differences that are summed to calculate output.

+------------+--------+-------------+
|       date | output |    diffs    |
+------------+--------+-------------+
| 01/01/2000 |        |             |
| 10/01/2000 |      9 | [9]         |
| 20/01/2000 |     29 | [10, 19]    |
| 25/01/2000 |     44 | [5, 15, 24] |
+------------+--------+-------------+

I've thought about using rolling and then creating a new column within each window for the diffs based on the last record in the current window and then summing these. However, rolling doesn't seem to have the ability to fix at the beginning of a DataFrame. I suppose I could calculate the difference between the minimum and maximum dates and use this as the rolling period but that seems hacky.

I've also looked at expanding but I couldn't see a way of creating new diffs as the window expanded.

Is there a non-loop, hopefully vectorisable, solution to this?

Here's the DataFrame:

import pandas as pd
import numpy as np


df = pd.DataFrame(
    {
        'date': (
            dt.datetime(2000, 1, 1), dt.datetime(2000, 1, 10),
            dt.datetime(2000, 1, 20), dt.datetime(2000, 1, 25),
        ),
        'output': (np.NaN, 9, 29, 44),
    }
)
Jossy
  • 589
  • 2
  • 12
  • 36

2 Answers2

2

If you're looking for output, try:

datediff = df.date.diff()/pd.Timedelta('1D')

df['output'] = (datediff * np.arange(len(df))).cumsum()

Output:

        date  output
0 2000-01-01     NaN
1 2000-01-10     9.0
2 2000-01-20    29.0
3 2000-01-25    44.0

I'll leave the it to you to work out the logic behind.

Quang Hoang
  • 146,074
  • 10
  • 56
  • 74
  • 1
    `datediff = df.date.diff().dt.days` might be less confusing – NassimH Jan 13 '22 at 03:19
  • @Quang Hang - thanks a million for this. The use of `np.arange` is really clever :) I don't suppose I could trouble you to ask how I would raise a constant to the power of each of the diffs and then sum the total? According to my manual calcs this should return `[NaN, 0.387420, 0.483764, 0.876148]`. My hunch is that this won't be easily possible given the way the `cumsum` works :( – Jossy Jan 13 '22 at 05:48
  • Are you asking about [ewm](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.ewm.html)? – Quang Hoang Jan 13 '22 at 05:55
  • Hey. I'm going to have to be a little careful here because I'm not even close to being a mathematician :) I 'think' this is the sort of thing I'm looking for but because I'm looking at a sum then ewm won't work for me. However, I have found a post that has given me a lead! https://stackoverflow.com/questions/33294489/pandas-exponentially-decaying-sum-with-variable-weights. Thanks for all your help! – Jossy Jan 13 '22 at 15:32
0

Using numpy broadcasting without looping:

i = df.date.dt.day.values
j = np.arange(len(df))

df['output'] = np.triu(np.where((j < j[:, None]), i, (i - i[:, None]))).sum(axis = 0)