16

I have the below code, using which I can calculate the volume-weighted average price by three lines of Pandas code.

import numpy as np
import pandas as pd
from pandas.io.data import DataReader
import datetime as dt

df = DataReader(['AAPL'], 'yahoo', dt.datetime(2013, 12, 30), dt.datetime(2014, 12, 30))
df['Cum_Vol'] = df['Volume'].cumsum()
df['Cum_Vol_Price'] = (df['Volume'] * (df['High'] + df['Low'] + df['Close'] ) /3).cumsum()
df['VWAP'] = df['Cum_Vol_Price'] / df['Cum_Vol']

I am trying to find a way to code this without using cumsum() as an exercise. I am trying to find a solution which gives the VWAP column in one pass. I have tried the below line, using .apply(). The logic is there, but the issue is I am not able to store values in row n in order to use in row (n+1). How do you approach this in pandas - just use an external tuplet or dictionary for temporary storage of cumulative values?

df['Cum_Vol']= np.nan
df['Cum_Vol_Price'] = np.nan
# calculate running cumulatives by apply - assume df row index is 0 to N
df['Cum_Vol'] = df.apply(lambda x: df.iloc[x.name-1]['Cum_Vol'] + x['Volume'] if int(x.name)>0 else x['Volume'], axis=1)

Is there a one-pass solution to the above problem?

EDIT:

My main motivation is to understand what is happening under the hood. So, it is mainly for exercise than any valid reason. I believe each cumsum on a Series of size N has time complexity N (?). So I was wondering, instead of running two separate cumsum's, can we calculate both in one pass - along the lines of this. Very happy to accept an answer to this - rather than working code.

JohnE
  • 29,156
  • 8
  • 79
  • 109
Zhubarb
  • 11,432
  • 18
  • 75
  • 114
  • Using apply will be substantially slower than your first method by the way – EdChum Mar 27 '15 at 10:47
  • @EdChum, thank you do you have an alternative solution without using `cumsum`? – Zhubarb Mar 27 '15 at 10:59
  • Not at the moment, cumsum is a vectorised method apply will not beat this. – EdChum Mar 27 '15 at 11:01
  • @JohnE, my main motivation is to understand what is happening under the hood. So, it is mainly for exercise than any valid reason. I believe each `cumsum` on a Series of size N has time complexity N. So I was wondering instead of running two separate `cumsum`'s, can we calculate both in one pass - along the lines of [this](http://www.strchr.com/standard_deviation_in_one_pass). Very happy to accept an answer to this - rather than working code. – Zhubarb Mar 27 '15 at 12:43

2 Answers2

20

Quick Edit: Just wanted to thank John for the original post :)

You can get even faster results by @jit-ing numpy's version:

@jit
def np_vwap():
    return np.cumsum(v*(h+l)/2) / np.cumsum(v)

This got me 50.9 µs per loop as opposed to 74.5 µs per loop using the vwap version above.

Zoe
  • 27,060
  • 21
  • 118
  • 148
Ran Aroussi
  • 571
  • 1
  • 8
  • 11
  • 1
    Thanks for the improvement! I just timed it myself and didn't get quite as big a speed boost, but your way is definitely faster. I think numba has gotten better at combining with numpy over time. – JohnE Mar 05 '16 at 17:37
19

Getting into one pass vs one line starts to get a little semantical. How about this for a distinction: you can do it with 1 line of pandas, 1 line of numpy, or several lines of numba.

from numba import jit

df=pd.DataFrame( np.random.randn(10000,3), columns=['v','h','l'] )

df['vwap_pandas'] = (df.v*(df.h+df.l)/2).cumsum() / df.v.cumsum()

@jit
def vwap():
    tmp1 = np.zeros_like(v)
    tmp2 = np.zeros_like(v)
    for i in range(0,len(v)):
        tmp1[i] = tmp1[i-1] + v[i] * ( h[i] + l[i] ) / 2.
        tmp2[i] = tmp2[i-1] + v[i]
    return tmp1 / tmp2

v = df.v.values
h = df.h.values
l = df.l.values

df['vwap_numpy'] = np.cumsum(v*(h+l)/2) / np.cumsum(v)

df['vwap_numba'] = vwap()

Timings:

%timeit (df.v*(df.h+df.l)/2).cumsum() / df.v.cumsum()  # pandas
1000 loops, best of 3: 829 µs per loop

%timeit np.cumsum(v*(h+l)/2) / np.cumsum(v)            # numpy
10000 loops, best of 3: 165 µs per loop

%timeit vwap()                                         # numba
10000 loops, best of 3: 87.4 µs per loop
JohnE
  • 29,156
  • 8
  • 79
  • 109