7

I'm trying to efficiently compute a running sum, with exponential decay, of each column of a Pandas DataFrame. The DataFrame contains a daily score for each country in the world. The DataFrame looks like this:

                AF        UK        US
2014-07-01  0.998042  0.595720  0.524698
2014-07-02  0.380649  0.838436  0.355149
2014-07-03  0.306240  0.274755  0.964524
2014-07-04  0.396721  0.836027  0.225848
2014-07-05  0.151291  0.677794  0.603548
2014-07-06  0.558846  0.050535  0.551785
2014-07-07  0.463514  0.552748  0.265537
2014-07-08  0.240282  0.278825  0.116432
2014-07-09  0.309446  0.096573  0.246021
2014-07-10  0.800977  0.583496  0.713893

I'm not sure how to calculate the rolling sum (with decay) without iterating through the dataframe, since I need to know yesterday's score to calculate today's score. But to calculate yesterday's score, I need to know the day before yesterday's score, etc. This is the code that I've been using, but I'd like a more efficient way to go about it.

for j, val in df.iteritems():
    for i, row in enumerate(val):
        df[j].iloc[i] = row + val[i-1]*np.exp(-0.05)
idubs11
  • 169
  • 1
  • 10

1 Answers1

9

You can use the fact that when exponentials multiply their exponents add:

eg:

N(2) = N(2) + N(1) * exp(-0.05)
N(3) = N(3) + (N(2) + N(1) * exp(-0.05))*exp(-0.05)
N(3) = N(3) + N(2)*exp(-0.05) + N(1)*exp(-0.1)
N(4) = ...and so on

This can then be vectorized using numpy:

dataset = pd.DataFrame(np.random.rand(1000,3), columns=["A", "B","C"])

weightspace = np.exp(np.linspace(len(dataset), 0, num=len(dataset))*-0.05)
def rollingsum(array):
    weights = weightspace[0-len(array):]
    # Convolve the array and the weights to obtain the result
    a = np.dot(array, weights).sum()
    return a


a = pd.expanding_apply(dataset, rollingsum)

pd.expanding_apply applies the rollingsum function backwards to each row, calling it len(dataset) times. np.linspace generates a dataset of size len(dataset) and calculates how many times each row is multiplied by exp(-0.05) for the current row.

Because it is vectorized, it should be fast:

%timeit a = pd.expanding_apply(dataset, rollingsum)
10 loops, best of 3: 25.5 ms per loop

This compares with (note I'm using python 3 and had to make a change to the behaviour on the first row...):

def multipleApply(df):
    for j, val in df.iteritems():
        for i, row in enumerate(val):
            if i == 0:
                continue
            df[j].iloc[i] = row + val[i-1]*np.exp(-0.05)

This comes out as:

In[68]: %timeit multipleApply(dataset)
1 loops, best of 3: 414 ms per loop
undershock
  • 754
  • 1
  • 6
  • 26
  • @ZJS is it? - I'll update the post with a slight change I had to make to get it working in python 3 – undershock Sep 03 '14 at 18:38
  • `%timeit t(df)` 100 loops, best of 3: 3.96 ms per loop – ZJS Sep 03 '14 at 18:46
  • @ZFS what data are you running this on? – undershock Sep 03 '14 at 18:46
  • edited to take the weightspace out of the function to make it faster – undershock Sep 03 '14 at 18:47
  • No sorry I didn't realize you provided a different dataframe. This is a good solution – ZJS Sep 03 '14 at 18:49
  • @shipt you're a legend. This is exactly what I was looking for. – idubs11 Sep 03 '14 at 20:16
  • @idubs11 - my bad, missed a colon in the weightspace slice. Fixed now, and returns: `A B C 2014-07-01 0.000000 1.000000 0.000000 2014-07-02 0.000000 0.951182 0.000000 2014-07-03 0.000000 0.904747 0.000000 2014-07-04 0.000000 0.860579 0.000000 2014-07-05 0.000000 0.818567 0.000000 2014-07-06 0.000000 0.778606 1.000000 2014-07-07 0.000000 0.740596 0.951182 2014-07-08 0.000000 0.704441 0.904747 2014-07-09 1.000000 0.670052 0.860579 2014-07-10 0.951182 0.637341 0.818567` – undershock Sep 03 '14 at 21:38
  • @shipt perfect. Accepting answer. – idubs11 Sep 03 '14 at 21:43