I have a DataFrame with 1,500,000 rows. It's one-minute level stock market data that I bought from QuantQuote.com. (Open, High, Low, Close, Volume). I'm trying to run some home-made backtests of stockmarket trading strategies. Straight python code to process the transactions is too slow and I wanted to try to use numba to speed things up. The trouble is that numba doesn't seem to work with pandas functions.
Google searches uncover a surprising lack of information about using numba with pandas. Which makes me wonder if I'm making a mistake by considering it.
My setup is Numba 0.13.0-1, Pandas 0.13.1-1. Windows 7, MS VS2013 with PTVS, Python 2.7, Enthought Canopy
My existing Python+Pandas innerloop has the following general structure
- Compute "indicator" columns, (with pd.ewma, pd.rolling_max, pd.rolling_min etc.)
- Compute "event" columns for predetermined events such as moving average crosses, new highs etc.
I then use DataFrame.iterrows to process the DataFrame.
I've tried various optimizations but it's still not as fast as I would like. And the optimizations are causing bugs.
I want to use numba to process the rows. Are there preferred methods of approaching this?
Because my DataFrame is really just a rectangle of floats, I was considering using something like DataFrame.values to get access to the data and then write a series of functions that use numba to access the rows. But that removes all the timestamps and I don't think it is a reversible operation. I'm not sure if the values matrix that I get from DataFrame.values is guaranteed to not be a copy of the data.
Any help is greatly appreciated.