11

I'm trying to use Cython to parallelize an expensive operation which involves generating intermediate multidimensional arrays.

The following very simplified code illustrates the sort of thing I'm trying to do:

import numpy as np
cimport cython
cimport numpy as np
from cython.parallel cimport prange
from libc.stdlib cimport malloc, free


@cython.boundscheck(False)
@cython.wraparound(False)
def embarrasingly_parallel_example(char[:, :] A):

    cdef unsigned int m = A.shape[0]
    cdef unsigned int n = A.shape[1]
    cdef np.ndarray[np.float64_t, ndim = 2] out = np.empty((m, m), np.float64)
    cdef unsigned int ii, jj
    cdef double[:, :] tmp

    for ii in prange(m, nogil=True):
        for jj in range(m):

            # allocate a temporary array to hold the result of
            # expensive_function_1
            tmp_carray = <double * > malloc((n ** 2) * sizeof(double))

            # a 2D typed memoryview onto tmp_carray
            tmp = <double[:n, :n] > tmp_carray

            # shove the intermediate result in tmp
            expensive_function_1(A[ii, :], A[jj, :], tmp)

            # get the final (scalar) output for this ii, jj
            out[ii, jj] = expensive_function_2(tmp)

            # free the intermediate array
            free(tmp_carray)

    return out


# some silly examples - the actual operation I'm performing is a lot more
# involved
# ------------------------------------------------------------------------
@cython.boundscheck(False)
@cython.wraparound(False)
cdef void expensive_function_1(char[:] x, char[:] y, double[:, :] tmp):

    cdef unsigned int m = tmp.shape[0]
    cdef unsigned int n = x.shape[0]
    cdef unsigned int ii, jj

    for ii in range(m):
        for jj in range(m):
            tmp[ii, jj] = 0
            for kk in range(n):
                tmp[ii, jj] += (x[kk] + y[kk]) * (ii - jj)


@cython.boundscheck(False)
@cython.wraparound(False)
cdef double expensive_function_2(double[:, :] tmp):

    cdef unsigned int m = tmp.shape[0]
    cdef unsigned int ii, jj
    cdef double result = 0

    for ii in range(m):
        for jj in range(m):
            result += tmp[ii, jj]

    return result

There seems to be at least two reasons why this fails to compile:

  1. Based on the output of cython -a, the creation of the typed memory view here:

    cdef double[:, :] tmp = <double[:n, :n] > tmp_carray
    

    seems to involve Python API calls, and I therefore can't release the GIL to allow the outer loop to run in parallel.

    I was under the impression that typed memory views were not Python objects, and therefore a child process ought to be able to create them without first acquiring the GIL. Is this the case?

2. Even if I replace prange(m, nogil=True) with a normal range(m), Cython still doesn't seem to like the presence of a cdef within the inner loop:

    Error compiling Cython file:
    ------------------------------------------------------------
    ...
                # allocate a temporary array to hold the result of
                # expensive_function_1
                tmp_carray = <double*> malloc((n ** 2) * sizeof(double))

                # a 2D typed memoryview onto tmp_carray
                cdef double[:, :] tmp = <double[:n, :n]> tmp_carray
                    ^
    ------------------------------------------------------------

    parallel_allocate.pyx:26:17: cdef statement not allowed here

Update

It turns out that the second problem was easily solved by moving

 cdef double[:, :] tmp

outside of the for loop, and just assigning

 tmp = <double[:n, :n] > tmp_carray

within the loop. I still don't fully understand why this is necessary, though.

Now if I try to use prange I hit the following compilation error:

Error compiling Cython file:
------------------------------------------------------------
...
            # allocate a temporary array to hold the result of
            # expensive_function_1
            tmp_carray = <double*> malloc((n ** 2) * sizeof(double))

            # a 2D typed memoryview onto tmp_carray
            tmp = <double[:n, :n]> tmp_carray
               ^
------------------------------------------------------------

parallel_allocate.pyx:28:16: Memoryview slices can only be shared in parallel sections
ali_m
  • 71,714
  • 23
  • 223
  • 298
  • 1
    I think I've the answer to 'I still don't fully understand why this is necessary, though.' See [this question](http://stackoverflow.com/questions/21572718/why-cython-forces-declaration-of-locals-at-the-beginning-of-a-function) – hivert Mar 06 '14 at 18:08
  • @hivert OK, that does make sense. I'm still not totally clear on what the consequences are of declaring `tmp` in the outer loop. If I was able to parallelize the outer loop, would this mean that when I make the assignment `tmp = tmp_carray` that this ought to affect where all of the child threads are reading/writing in memory (which would obviously not work), or would the fact that I `malloc` a new buffer for each inner loop get around this? – ali_m Mar 06 '14 at 18:31
  • I encountered the same problem. Have you already found a fix for it? – colinfang Mar 24 '15 at 17:00
  • @colinfang No, as far as I'm aware this is still a limitation of Cython. The easiest workaround is to allocate a memoryview buffer outside of the parallel block, then pass a pointer to the appropriate row of the buffer for each worker process. To be honest it might well be better to do it this way in terms of performance, since child threads don't then have to compete for memory allocation. – ali_m Mar 24 '15 at 17:27

2 Answers2

6

Disclaimer: Everything here is to be taken with a grain of salt. I'm more guessing that knowing. You should certainly ask the question on Cython-User. They are always friendly and fast to answer.

I agree that Cython's documentation is not very clear:

[...] memoryviews often do not need the GIL:

cpdef int sum3d(int[:, :, :] arr) nogil: ...

In particular, you do not need the GIL for memoryview indexing, slicing or transposing. Memoryviews require the GIL for the copy methods (C and Fortran contiguous copies), or when the dtype is object and an object element is read or written.

I think this means that passing a memory view parameter, or using it for slicing or transposing doesn't need Python GIL. However, creating a memoryview or copying one needs the GIL.

Another argument supporting this is that is is possible for a Cython function to return to Python a memory view.

from cython.view cimport array as cvarray
import numpy as np

def bla():
    narr = np.arange(27, dtype=np.dtype("i")).reshape((3, 3, 3))
    cdef int [:, :, :] narr_view = narr
    return narr_view

Gives:

>>> import hello
>>> hello.bla()
<MemoryView of 'ndarray' at 0x1b03380>

which means that the memoryview is allocated in Python's GC managed memory and thus needs the GIL to be created. So you can't cant create a memoryview in a nogil section


Now for what concerns the error message

Memoryview slices can only be shared in parallel sections

I think you should read it as "You can't have a thread private memoryview slices. It must be a thread shared memoryview slices".

hivert
  • 10,579
  • 3
  • 31
  • 56
  • I was beginning to suspect that this might be the case - thanks for the confirmation. Do you have any suggestions for the least painful way to implement an array with multidimensional indexing without running afoul of the GIL? I'm thinking of just `malloc`ing flat intermediate arrays, then writing a C helper function to convert a set of multidimensional indices to a single index into the flat array. I'm sure there must be an easier way, though - I'll ask on Cython-User as well. – ali_m Mar 06 '14 at 23:24
  • @ali_m : I saw that you accepted my post. Did you get any confirmation from the Cython developpers ? – hivert Mar 10 '14 at 20:29
  • I implemented the nasty 'indexing into flat `malloc`ed arrays' approach, then ran into a more fundamental problem - Cython just doesn't allow me to declare thread-local variables in order to carry out the reduction in the inner loop. I found [this discussion](https://groups.google.com/d/msg/cython-users/Ady-DdWu6rE/sDsfMKzqQ5MJ) - thread-local variables are just not implemented in Cython at the moment. It's possible to achieve the same effect through various contortions (at least for scalars, see Mark Florisson's answer in that thread), but it was more trouble than it was worth. – ali_m Mar 10 '14 at 22:15
0

http://docs.cython.org/src/userguide/external_C_code.html#releasing-the-gil

"""

Releasing the GIL

You can release the GIL around a section of code using the with nogil statement:

 with nogil:
<code to be executed with the GIL released> Code in the body of the statement must not manipulate Python objects in any way, and must

not call anything that manipulates Python objects without first re-acquiring the GIL. Cython currently does not check this.

"""

jsbueno
  • 99,910
  • 10
  • 151
  • 209
  • Yes, I've read the docs. `prange(..., nogil=True)` is equivalent to wrapping the `for` loop in a `with nogil:` statement. The issue I'm encountering is that the creation of the typed memory view seems to involve Python API calls, even though the documentation [here](http://docs.cython.org/src/userguide/memoryviews.html) would suggests that it doesn't require the GIL. – ali_m Mar 06 '14 at 16:08