10

Similar to this question Exponential Decay on Python Pandas DataFrame, I would like to quickly compute exponentially decaying sums for some columns in a data frame. However, the rows in the data frame are not evenly spaced in time. Hence while exponential_sum[i] = column_to_sum[i] + np.exp(-const*(time[i]-time[i-1])) * exponential_sum[i-1], the weight np.exp(...) does not factor out and it's not obvious to me how to change to that question and still take advantage of pandas/numpy vectorization. Is there a pandas vectorized solution to this problem?

To illustrate the desired calculation, here is a sample frame with the exponential moving sum of A stored in Sum using a decay constant of 1:

    time  A       Sum
0   1.00  1  1.000000
1   2.10  3  3.332871
2   2.13 -1  2.234370
3   3.70  7  7.464850
4  10.00  2  2.013708
5  10.20  1  2.648684
Community
  • 1
  • 1
pythonic metaphor
  • 10,296
  • 18
  • 68
  • 110
  • can you resample your dataframe so that it is evenly spaced? – maxymoo Oct 23 '15 at 04:31
  • @Alexander I am asking about sums, not averages, though maybe there is an obvious transform – pythonic metaphor Oct 23 '15 at 16:30
  • @Alexander I just read that question more carefully and I don't think it addresses my question, which is how to the vectorized numpy/pandas calculation. I don't have any issue computing the exponential sums in a python loop, I'm just doing it on sufficiently large frames that being able to vectorize the calculation matters. – pythonic metaphor Oct 23 '15 at 17:45
  • Can you please provide some sample data? – Alexander Oct 23 '15 at 19:04

2 Answers2

6

This question is more complicated than it first appeared. I ended up using numba's jit to compile a generator function to calculate the exponential sums. My end result calculates the exponential sum of 5 million rows in under a second on my computer, which hopefully is fast enough for your needs.

# Initial dataframe.
df = pd.DataFrame({'time': [1, 2.1, 2.13, 3.7, 10, 10.2], 
                   'A': [1, 3, -1, 7, 2, 1]})

# Initial decay parameter.
decay_constant = 1

We can define the decay weights as exp(-time_delta * decay_constant), and set its initial value equal to one:

df['weight'] = np.exp(-df.time.diff() * decay_constant)
df.weight.iat[0] = 1

>>> df
   A   time    weight
0  1   1.00  1.000000
1  3   2.10  0.332871
2 -1   2.13  0.970446
3  7   3.70  0.208045
4  2  10.00  0.001836
5  1  10.20  0.818731

Now we'll use jit from numba to optimize a generator function that calculates the exponential sums:

from numba import jit

@jit(nopython=True)
def exponential_sum(A, k):
    total = A[0]
    yield total
    for i in xrange(1, len(A)):  # Use range in Python 3.
        total = total * k[i] + A[i]
        yield total

We'll use the generator to add the values to the dataframe:

df['expSum'] = list(exponential_sum(df.A.values, df.weight.values))

Which produces the desired output:

>>> df
   A   time    weight    expSum
0  1   1.00  1.000000  1.000000
1  3   2.10  0.332871  3.332871
2 -1   2.13  0.970446  2.234370
3  7   3.70  0.208045  7.464850
4  2  10.00  0.001836  2.013708
5  1  10.20  0.818731  2.648684

So let's scale to 5 million rows and check performance:

df = pd.DataFrame({'time': np.random.rand(5e6).cumsum(), 'A': np.random.randint(1, 10, 5e6)})
df['weight'] = np.exp(-df.time.diff() * decay_constant)
df.weight.iat[0] = 1

%%timeit -n 10 
df['expSum'] = list(exponential_sum(df.A.values, df.weight.values))
10 loops, best of 3: 726 ms per loop
Alexander
  • 105,104
  • 32
  • 201
  • 196
  • I was using Cython for a similar solution, but had been hoping there was a clever use of numpy/scipy that I was missing. It seems the consensus is no. A variation of this answer seems to be the best you can do. – pythonic metaphor Oct 27 '15 at 17:49
0

Expanding on the answer you linked to, I came up with the following method.

First, notice that:

exponential_sum[i] = column_to_sum[i] + 
    np.exp(-const*(time[i]-time[i-1])) * column_to_sum[i-1] + 
    np.exp(-const*(time[i]-time[i-2])) * column_to_sum[i-2] + ...

So the main change to make is in generating the weightspace to match the formula above. I proceeded like this:

time = pd.Series(np.random.rand(10)).cumsum()
weightspace = np.empty((10,10))
for i in range(len(time)):
    weightspace[i] = time - time[i]
weightspace = np.exp(weightspace)

Don't worry about the lower left triangle of the matrix, it won't be used. By the way, there must be a way of generating the weightspace without a loop.

Then a slight change in how you pick the weights from the weightspace in the rolling function:

def rollingsum(array):
    weights = weightspace[len(array)-1][:len(array)]
    # Convolve the array and the weights to obtain the result
    a = np.dot(array, weights).sum()
    return a

Works as expected:

dataset = pd.DataFrame(np.random.rand(10,3), columns=["A", "B","C"])
a = pd.expanding_apply(dataset, rollingsum)
Community
  • 1
  • 1
IanS
  • 15,771
  • 9
  • 60
  • 84
  • One concern about this solution is that weightspace is now very big. In the solution to the regular case it was linear in the size of the data frame and now it's quadratic. This makes it problematic for large frames. Large frames are why the vectorized solution is needed. Is that unavoidable? – pythonic metaphor Oct 23 '15 at 17:33
  • Short of an optimized for loop like @Alexander suggested I'm afraid I don't see another way. – IanS Oct 26 '15 at 08:38