Fast assignment to Cython memoryview

Question

I have to assign values to a 3-dimensional array. After some benchmarking, I found that assignment of the value is much slower than computing the value to assign.

Which means

for i in range(n):
        for j in range(g):
            for k in range(t):
                Mu_view[i,j,k] = exp(Xbeta_view[i,j]+Gamma[j,k]) * S[i]

is slower than

for i in range(n):
        for j in range(g):
            for k in range(t):
                exp(Xbeta_view[i,j]+Gamma[j,k]) * S[i]

The benchmark : order of magnitude difference in runtime

# no assignment
25.5 ms ± 260 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# with assignment
1.7 s ± 23.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

The full code of the function

@cython.boundscheck(False)
@cython.wraparound(False)
def get_Mu(double[::1,:] N,
       double[::1,:] X,
       double[::1,:] Beta,
       double[::1,:] Gamma,
       double[::1] S):
    cdef int i, j, k
    cdef int n, g, p, t
    n, g, p, t = N.shape[0], Beta.shape[0], Beta.shape[1], Gamma.shape[1]

    cdef double x


    # declare Mu : shape= n * g * t and its memoryview
    cdef np.ndarray[double, ndim=3, mode='fortran'] Mu = np.empty(shape=(n,g,t), dtype=DTYPE, order='F')
    cdef double[::1,:,:] Mu_view = Mu

    # obtain XBeta and compute Mu on mem-view
    cdef np.ndarray[double, ndim=2] Xbeta = matmul_A_B_t(X, Beta)
    cdef double[::1,:] Xbeta_view = Xbeta

    for i in range(n):
        for j in range(g):
            for k in range(t):
                Mu_view[i,j,k] = exp(Xbeta_view[i,j]+Gamma[j,k]) * S[i]

    return Mu

I think that each time I call Mu_view[i,j,k], it is creating a new slice that creates a overhead but I can't come up with a solution.

ADD: Input variables

n, g, p, t = 10000, 500, 10, 20
N = np.ones(shape=(n,g), order='F', dtype=float)
X = np.ones(shape=(n,p), order='F', dtype=float)
Beta = np.ones(shape=(g,p), order='F', dtype=float)
Gamma = np.ones(shape=(g,t), order='F', dtype=float)
S = np.ones(n, dtype=float)

It is possible that it optimizes it out the whole thing without the assignment, since it knows the results are unused. Try something like `sum += exp(...)` (with `sum` cdefed as a double) to make sure it can't do that. I'm not convinced that's what's happening, but it might be... — DavidW, May 11 '20 at 06:12
I defined `sum += exp(Xbeta_view[i,j]+Gamma[j,k]) * S[i]` and gets 0.7s. I also tried `val = exp(Xbeta_view[i,j]+Gamma[j,k]) * S[i]` and it also returns 0.7s. However, adding `Mu[i,j,k] = val` gives 1.7s which is the same to the original slow case. — deep_lazy, May 11 '20 at 06:29
to add, if i remove `exp` from the original slow case, i get 0.7s. — deep_lazy, May 11 '20 at 06:36
It might be worth swapping the order of the loops. You typically want the contiguous index to be the inner loop (`i` in your case) — DavidW, May 11 '20 at 07:03
To be honest that may just be how fast it goes. It doesn't seem completely unreasonable. You might be able to speed it up a little by flattening the array so it's 1D, but beyond that nothing looks obviously wrong. — DavidW, May 11 '20 at 17:43

Fast assignment to Cython memoryview

ADD: Input variables

0 Answers0