Efficient way to process pandas DataFrame timeseries with Numba

Question

I have a DataFrame with 1,500,000 rows. It's one-minute level stock market data that I bought from QuantQuote.com. (Open, High, Low, Close, Volume). I'm trying to run some home-made backtests of stockmarket trading strategies. Straight python code to process the transactions is too slow and I wanted to try to use numba to speed things up. The trouble is that numba doesn't seem to work with pandas functions.

Google searches uncover a surprising lack of information about using numba with pandas. Which makes me wonder if I'm making a mistake by considering it.

My setup is Numba 0.13.0-1, Pandas 0.13.1-1. Windows 7, MS VS2013 with PTVS, Python 2.7, Enthought Canopy

My existing Python+Pandas innerloop has the following general structure

Compute "indicator" columns, (with pd.ewma, pd.rolling_max, pd.rolling_min etc.)
Compute "event" columns for predetermined events such as moving average crosses, new highs etc.

I then use DataFrame.iterrows to process the DataFrame.

I've tried various optimizations but it's still not as fast as I would like. And the optimizations are causing bugs.

I want to use numba to process the rows. Are there preferred methods of approaching this?

Because my DataFrame is really just a rectangle of floats, I was considering using something like DataFrame.values to get access to the data and then write a series of functions that use numba to access the rows. But that removes all the timestamps and I don't think it is a reversible operation. I'm not sure if the values matrix that I get from DataFrame.values is guaranteed to not be a copy of the data.

Any help is greatly appreciated.

you should post your code as a separate question and see if folks here can help you vectorize. IMHO not much reason to actually use numba as pandas can do a lot more with vectorizing (no loops). pandas uses cython under the hood so most operations are optimized. get your code correct, then optimize. To answer your question, you *can* use ``df.values`` to get the underlying numpy array and process if you want, but you will then be responsible to translate back to a DataFrame (if you want). — Jeff, May 13 '14 at 11:49
It isn't possible to process the data in an entirely vectorized way. And to analyse my results I produce very large PNG files that look like this https://www.dropbox.com/s/p66mvp54dymi7hv/TABLE_AAPL.TXT.png. I plot these quickly by torturing myself with Chaco. As you can see, there are a LOT of columns produced to store intermediate results. The processing of a single trade is now a function 80 lines long. My typical innerloop for a strategy is now around 350 lines of non-repetitive python+pandas. Hard to avoid bugs in this situation. Its bloated by optimisations. I'll upload a copy — JasonEdinburgh, May 13 '14 at 12:46
ok, that all looks vectorizable (in general only a recurrent relation is NOT vectorizable directly, though sometimes they are possible, e.g. via shift/diff), but I understand your conundrum. You cannot really mix numba with pandas; try using df.values. — Jeff, May 13 '14 at 13:04
@Jeff unfortunately Its not vectorizable. At least not in the meaning that I understand by that word. As soon as you have fixed price stop losses that are placed at run time, vectorization falls down. You cannot know what price the stop loss would have in advance, or that you might have one. Also, there is a current state (accounting) which must be maintained as the algorithm progresses. I forget the term, Markov chain? I'm not a mathematician. reading the wiki for recurrent relation sounds like what I'm trying to describe. — JasonEdinburgh, May 13 '14 at 14:24
ok...then numba might be a good option for you. (or simply could write in cython), see here: http://pandas.pydata.org/pandas-docs/stable/enhancingperf.html — Jeff, May 13 '14 at 14:36

score 12 · Answer 1 · answered Jun 27 '17 at 12:37

Numba is a NumPy-aware just-in-time compiler. You can pass NumPy arrays as parameters to your Numba-compiled functions, but not Pandas series.

Your only option, still as of 2017-06-27, is to use the Pandas series values, which are actually NumPy arrays.

Also, you ask if the values are "guaranteed to not be a copy of the data". They are not a copy, you can verify that:

import pandas


df = pandas.DataFrame([0, 1, 2, 3])
df.values[2] = 8
print(df)  # Should show you the value `8`

In my opinion, Numba is a great (if not the best) approach to processing market data and you want to stick to Python only. If you want to see great performance gains, make sure to use @numba.jit(nopython=True) (note that this will not allow you to use dictionaries and other Python types inside the JIT-compiled functions, but will make the code run much faster).

Note that some of those indicators you are working with may already have an efficient implementation in Pandas, so consider pre-computing them with Pandas and then pass the values (the NumPy array) to your Numba backtesting function.

Latest on compatibility of pandas with numba should be found here: https://pandas.pydata.org/pandas-docs/stable/enhancingperf.html#using-numba — feetwet, Mar 10 '18 at 19:29

Efficient way to process pandas DataFrame timeseries with Numba

1 Answers1

Linked