Cython program is slower than plain Python (10M options 3.5s vs 3.25s Black Scholes) - what am I missing?

Question

Okay here's my first Cython program below, the code to price European options on futures (Black Scholes without a dividend). It runs in 3.5s on 10M options, versus the code I posted below with straight numpy Python 3.25s. Can anyone point out why my Cython code is slower - like because I used a loop instead of vectorizing the call (not sure in C how to do that, the generated cython code appears to vectorize it though). Can I use nogil and openmp around this loop, even though the variables are passed in from numpy arrays? The simple examples posted on Cython's examples don't compile correctly with cython.parallel prange on the loop http://docs.cython.org/src/userguide/parallelism.html#module-cython.parallel. Feedback greatly appreciated, apologies for a somewhat open-ended question - others can use this code freely here as a starting point since it already works faster than other work profiled online that I've seen, in C and Python. Here it is:

Save as CyBlack.pyx file to compile (note all inputs are float64 except Black_callput which is int64, 1 for a call, -1 for a put). After compiling, from CyBlack.CyBlack import CyBlack:

from numpy cimport ndarray
cimport numpy as np
cimport cython

cdef extern from "math.h":
    double exp(double)
    double sqrt(double)
    double log(double)
    double erf(double)

cdef double std_norm_cdf(double x):
    return 0.5*(1+erf(x/sqrt(2.0)))

@cython.boundscheck(False)
cpdef CyBlack(ndarray[np.float64_t, ndim=1] BlackPnL, ndarray[np.float64_t, ndim=1] Black_S, ndarray[np.float64_t, ndim=1] Black_Texpiry, ndarray[np.float64_t, ndim=1] Black_strike, ndarray [np.float64_t, ndim=1] Black_volatility, ndarray[np.float64_t, ndim=1] Black_IR, ndarray[np.int64_t, ndim=1] Black_callput):

    cdef Py_ssize_t i
    cdef Py_ssize_t N = BlackPnL.shape[0]
    cdef double d1, d2


    for i in range(N):
        d1 = ((log(Black_S[i] / Black_strike[i]) + Black_Texpiry[i] * Black_volatility[i] **2 / 2)) / (Black_volatility[i] * sqrt(Black_Texpiry[i]))
        d2 = d1 - Black_volatility[i] * sqrt(Black_Texpiry[i])
        BlackPnL[i] = exp(-Black_IR[i] * Black_Texpiry[i]) * (Black_callput[i] * Black_S[i] * std_norm_cdf(Black_callput[i] * d1) - Black_callput[i] * Black_strike[i] * std_norm_cdf(Black_callput[i] * d2)) 

    return BlackPnL

Here is the setup.py so others can build this typing: python setup.py build_ext --inplace built with VS2015 for Python 3.5 64bit Windows.

from setuptools import setup
from setuptools import Extension
from Cython.Distutils import build_ext
import numpy as np

ext_modules = [Extension("CyBlack",sources=["CyBlack.pyx"],
          extra_compile_args=['/Ox', '/openmp', '/favor:INTEL64'],
          language='c++')]

setup(
    name= 'Generic model class',
    cmdclass = {'build_ext': build_ext},
    include_dirs = [np.get_include()],
    ext_modules = ext_modules)

Okay and here is my very fast numpy Python only code:

import numpy as np
from scipy.stats import norm

d1=((np.log(Black_S / Black_strike) + Black_Texpiry * Black_volatility **2 / 2)) / (Black_volatility * np.sqrt(Black_Texpiry))
d2=d1 - Black_volatility * np.sqrt(Black_Texpiry)
BlackPnL = np.exp(-Black_IR * Black_Texpiry) * (Black_callput * Black_S * norm.cdf(Black_callput * d1) - Black_callput * Black_strike * norm.cdf(Black_callput * d2))

You shouldn't usually expect Cython to provide a major speedup over vectorized numpy operations. The numpy operations are already being done in C. — BrenBarn, May 11 '16 at 18:06
BrenBarn is right -- it's the *non-vectorized* operations you get the most benefit from dropping to C. One thing you *can* do, though, to see if there are optimizations not yet captured is to run cython with `-a` and look at the resulting HTML in a browser. A quick skim makes it look like it's calling `pow` instead of squaring, for example. — DSM, May 11 '16 at 18:13
Makes sense, I did change `pow` to `**` and `sqrt` to `**0.5` already, I was thinking I could do something with `no gil` and `openmp` potentially to get multi threading at least. This is just the most basic finance example which is a good building block for all other analytical option pricing libraries so I think others will benefit from reading this post. — Matt, May 11 '16 at 18:20

sebacastroh · Accepted Answer · 2016-05-12T00:02:07.323

I added the following lines before your functions in your cython code and I got a faster result from Cython than Python 2.7

@cython.boundscheck(False)
@cython.wraparound(False)
@cython.cdivision(True)

My results for 10M points

%timeit PyBlack(BlackPnL, Black_S, Black_Texpiry, Black_strike, Black_volatility, Black_IR, Black_callput)
1 loops, best of 3: 3.49 s per loop

and

%timeit CyBlack(BlackPnL, Black_S, Black_Texpiry, Black_strike, Black_volatility, Black_IR, Black_callput)
1 loops, best of 3: 2.12 s per loop

EDIT

CyBlack.pyx

from numpy cimport ndarray
cimport numpy as np
cimport cython

cdef extern from "math.h":
    double exp(double)
    double sqrt(double)
    double log(double)
    double fabs(double)

cdef double a1 = 0.254829592
cdef double a2 = -0.284496736
cdef double a3 =  1.421413741
cdef double a4 = -1.453152027
cdef double a5 =  1.061405429
cdef double p =  0.3275911 

@cython.boundscheck(False)
@cython.wraparound(False)
@cython.cdivision(True)
cdef inline double erf(double x):
    cdef int sign = 1
    if (x < 0):
        sign = -1
    x = fabs(x)

    cdef double t = 1.0/(1.0 + p*x)
    cdef double y = 1.0 - (((((a5*t + a4)*t) + a3)*t + a2)*t + a1)*t*exp(-x*x)

    return sign*y

@cython.boundscheck(False)
@cython.wraparound(False)
@cython.cdivision(True)
cdef double std_norm_cdf(double x):
    return 0.5*(1+erf(x/sqrt(2.0)))

@cython.boundscheck(False)
@cython.wraparound(False)
@cython.cdivision(True)
cpdef CyBlack(ndarray[np.float64_t, ndim=1] BlackPnL, ndarray[np.float64_t, ndim=1] Black_S, ndarray[np.float64_t, ndim=1] Black_Texpiry, ndarray[np.float64_t, ndim=1] Black_strike, ndarray [np.float64_t, ndim=1] Black_volatility, ndarray[np.float64_t, ndim=1] Black_IR, ndarray[np.int64_t, ndim=1] Black_callput):

    cdef Py_ssize_t i
    cdef Py_ssize_t N = BlackPnL.shape[0]
    cdef double d1, d2


    for i in range(N):
        d1 = ((log(Black_S[i] / Black_strike[i]) + Black_Texpiry[i] * Black_volatility[i] **2 / 2)) / (Black_volatility[i] * sqrt(Black_Texpiry[i]))
        d2 = d1 - Black_volatility[i] * sqrt(Black_Texpiry[i])
        BlackPnL[i] = exp(-Black_IR[i] * Black_Texpiry[i]) * (Black_callput[i] * Black_S[i] * std_norm_cdf(Black_callput[i] * d1) - Black_callput[i] * Black_strike[i] * std_norm_cdf(Black_callput[i] * d2)) 

    return BlackPnL

setup.py

try:
    from setuptools import setup
    from setuptools import Extension
except ImportError:
    from distutils.core import setup
    from distutils.extension import Extension

from Cython.Distutils import build_ext
import numpy as np

ext_modules = [Extension("CyBlack",["CyBlack.pyx"])]

setup(
    name= 'Generic model class',
    cmdclass = {'build_ext': build_ext},
    include_dirs = [np.get_include()],
    ext_modules = ext_modules)

I had tried that earlier and for some reason on python 3.5 it doesn't improve the performance at all. I have no idea why... — Matt, May 11 '16 at 23:19
I added the code I used in my answer. I didn't use your ```extra_compile_args = ['/EHsc', '/openmp', '/favor:INTEL64']``` and I had to write my own ```erf``` function. — sebacastroh, May 12 '16 at 00:03
I edited my setup.py in the original post to be compatible with Python 3.5 and used your template as the starting point. So far with the options I posted down to 3.25s roughly using your erf function, tied with Python 3.5 native numpy libraries. — Matt, May 12 '16 at 02:02
I upvoted your answer, thanks for actually trying the code out. For some reason with Python 3.5 it makes little difference. I've posted another question where I have actually added OpenMP but can't seem to call it after compiling here: http://stackoverflow.com/questions/37198246/cython-parallel-implementation-openmp-of-black-scholes-with-numpy-integrated-can — Matt, May 12 '16 at 23:56
Just selected your answer - I recompiled it and am seeing a great improvement: %timeit CyBlack(BlackPnL, Black_S, Black_Texpiry, Black_strike, Black_volatility, Black_IR, Black_callput) 1 loop, best of 3: 2.05 s per loop — Matt, May 13 '16 at 14:37
Just FYI there is a bug in the erf calculation above. I saw a very similar approximation online though, not sure what is not okay there. Just posting this as a note to those wanting to implement their own algo to check their results to ensure they are correct (using the C erf function I get exactly the same results out of my python function). BUT I just confirmed changing the `cdef double std_norm_cdf(double x)` to `cdef inline double std_norm_cdf(double x):` drops it down to 2.74s. — Matt, May 14 '16 at 00:22
This approximation I mentioned is much better, at most $0.015 difference on 10M options (2.2s): `cdef inline double std_norm_cdf(double x): cdef double a = -0.07056 cdef double b = -1.5976 return (1/(1 + exp(a * (x*x*x) +b * x)))` — Matt, May 14 '16 at 01:00
The approximation posted in this answer from @sebacastroh is from here, but I don't suggest using it due to the huge difference between that and the actual ERF C function. The simpler approximation above is better. However, here is the link for reference http://www.johndcook.com/blog/cpp_erf/ — Matt, May 14 '16 at 02:32

Cython program is slower than plain Python (10M options 3.5s vs 3.25s Black Scholes) - what am I missing?

1 Answers1

Linked