7

I am struggling to initialise thread-local ndarrays with cython.parallel:

Pseudo-code:

cdef:
    ndarray buffer

with nogil, parallel():
    buffer = np.empty(...)

    for i in prange(n):
        with gil:
            print "Thread %d: data address: 0x%x" % (threadid(), <uintptr_t>buffer.data)

        some_func(buffer.data)  # use thread-local buffer

cdef void some_func(char * buffer_ptr) nogil:
    (... works on buffer contents...)

My problem is that in all threads buffer.data points to the same address. Namely the address of the thread that last assigned buffer.

Despite buffer being assigned within the parallel() (or alternatively prange) block, cython does not make buffer a private or thread-local variable but keeps it as a shared variable.

As a result, buffer.data points to the same memory region wreaking havoc on my algorithm.

This is not a problem exclusively with ndarray objects but seemingly with all cdef class defined objects.

How do I solve this problem?

ARF
  • 7,420
  • 8
  • 45
  • 72
  • 1
    Can you call `np.empty` without the gil? – Bi Rico Mar 06 '15 at 23:08
  • 1
    perhaps [this answer](http://stackoverflow.com/a/20520295/832621) brings what you want... – Saullo G. P. Castro Mar 07 '15 at 09:33
  • 1
    @BiRico Is that a rhetorical question :) ? No, you definitely can't instantiate a numpy array (or a memoryview) within a `nogil` block (otherwise the array would not be allocated within Python's managed memory, and could not be garbage collected etc.) – ali_m Mar 08 '15 at 01:21
  • 1
    [I've battled with a similar issue before](http://stackoverflow.com/q/22229012/1461210). As far as I'm aware Cython does not allow for thread-private numpy arrays or memoryviews (if that situation has changed then I'd love to hear about it!). As Saullo suggests, your best option is probably to allocate one big array outside of the parallel block, then pass each worker thread a pointer to the segment of the array to use. – ali_m Mar 08 '15 at 01:22
  • It lets you compile? If you use memory view it doesn't even allow you to compile. "Memoryview slices can only be shared in parallel sections" – colinfang Mar 26 '15 at 16:24
  • can you post the actual code? – Davoud Taghawi-Nejad Sep 26 '15 at 01:19

1 Answers1

6

I think I have finally found a solution to this problem that I like. The short version is that you create an array that has shape:

(number_of_threads, ...<whatever shape you need in the thread>...) Then, call openmp.omp_get_thread_num and use that to index the array to get a "thread-local" sub-array. This avoids having a separate array for every loop index (which could be enormous) but also prevents threads overwriting each other.

Here's a rough version of what I did:

import numpy as np
import multiprocessing

from cython.parallel cimport parallel
from cython.parallel import prange
cimport openmp

cdef extern from "stdlib.h":
    void free(void* ptr)
    void* malloc(size_t size)
    void* realloc(void* ptr, size_t size)

...

cdef int num_items = ...
num_threads = multiprocessing.cpu_count()
result_array = np.zeros((num_threads, num_items), dtype=DTYPE) # Make sure each thread uses separate memory
cdef c_numpy.ndarray result_cn
cdef CDTYPE ** result_pointer_arr
result_pointer_arr = <CDTYPE **> malloc(num_threads * sizeof(CDTYPE *))
for i in range(num_threads):
    result_cn = result_array[i]
    result_pointer_arr[i] = <CDTYPE*> result_cn.data

cdef int thread_number
for i in prange(num_items, nogil=True, chunksize=1, num_threads=num_threads, schedule='static'):
    thread_number = openmp.omp_get_thread_num()
    some_function(result_pointer_arr[thread_number])
David
  • 328
  • 4
  • 10