Speeding up a linear transform using parallel Cython

Question

I need to speed up the calculation of a linear transform which is roughly of the following form:

import numpy as np

N=10000
input=np.random.random(N)

x=np.linspace(0,100,N)
y=np.linspace(0,30,N)
X,Y=np.meshgrid(x,y,sparse=True)

output=np.dot(np.cos(X*Y),input)

That is, I evaluate the cosine on a regular grid and multiply my input by the resulting matrix. In reality, the kernel function (here, the cosine) is more complicated, in particular it is not periodic. Hence, no simplification of FFT-type is possible!

The above transform takes about 5 seconds on my multi-core machine. Now, I definitely need to speed this up. A simple first try is to use numexpr:

import numpy as np
import numexpr as ne

N=10000
input=np.random.random(N)

x=np.linspace(0,100,N)
y=np.linspace(0,30,N)
X,Y=np.meshgrid(x,y,sparse=True)

output=np.dot(ne.evaluate('cos(X*Y)'),input)

This makes use of parallel computing and reduces the execution time to about 0.9 seconds. This is quite nice, but not sufficient for my purpose. So, my next try is to employ parallel Cython:

import numpy as np
from cython.parallel import prange

cimport numpy as np
cimport cython
from libc.math cimport cos

DTYPE = np.float64    
ctypedef np.float64_t DTYPE_t

@cython.boundscheck(False)
@cython.wraparound(False)
@cython.nonecheck(False)
def transform(double[:] x, double[:] y, double[:] input):

    cdef unsigned int N = x.shape[0]
    cdef double[:] output = np.zeros(N)
    cdef unsigned int row, col

    for row in prange(N, nogil= True):
        for col in range(N):
            output[row] += cos(x[row]*y[col])*input[col]

    return output

I compile this by executing

from distutils.core import setup
from distutils.extension import Extension
from Cython.Distutils import build_ext

ext_modules=[
    Extension("cythontransform",
          ["cythontransform.pyx"],
          libraries=["m"],
          extra_compile_args = ["-O3", "-ffast-math", "-march=native", "-fopenmp" ],
          extra_link_args=['-fopenmp']
          ) 
]

setup(
  name = "cythontransform",
  cmdclass = {"build_ext": build_ext},
  ext_modules = ext_modules
)

from the command line. Calling the transform via

import numpy as np
from cythontransform import transform

N=10000
input=np.random.random(N)

x=np.linspace(0,100,N)
y=np.linspace(0,30,N)

output=transform(x,y,input)

yields a rather weak improvement, giving roughly 0.7 seconds.

Is someone aware of the possibility of further improvements of the Cython code?

Or, alternatively, is there some other framework (PyOpenCL, Pythran, Numba, ...) that is better suited to this problem?

FYI: `np.empty(N)` does not initialize the new array to 0, so you'll likely get garbage out of the cython function. Use `np.zeros(N)`. — Warren Weckesser, Apr 03 '17 at 23:42
Is there a typo in the line `X,Y=np.meshgrid(x,x,sparse=True)`? Should that be `X,Y=np.meshgrid(x,y,sparse=True)`? If not, then you defined `y` but never used it. — Warren Weckesser, Apr 03 '17 at 23:51
This may just be how long the calculation takes. The Cython code looks pretty good. — DavidW, Apr 04 '17 at 06:59
Declaring the memoryviews as `double[::1]` forces alignment of the arrays and might provide a tiny speedup. — Pierre de Buyl, Apr 04 '17 at 14:17
Dear Pierre, indeed, this seems to give a speed up of a few percent. Better than nothing. Thanks! — phlegmax, Apr 04 '17 at 18:53

score 1 · Answer 1 · answered Dec 03 '17 at 21:03

On my laptop, the following pythran version:

#pythran export transform(float64[], float64[], float64[])
import numpy as np

def transform(x, y, input):
    N = x.shape[0]
    output = np.zeros(N)

    #omp parallel for
    for row in range(N):
        for col in range(N):
            output[row] += np.cos(x[row]*y[col])*input[col]
    return output

Compiled with

pythran python -Ofast dd.py -fopenmp

Runs roughly two times faster than the cython version you propose. I did not investigate why this happens though...

Speeding up a linear transform using parallel Cython

1 Answers1