2

I have an interval-treeish algorithm I would like to run in parallel for many queries using threads. Problem is that then each thread would need its own array, since I cannot know in advance how many hits there will be.

There are other questions like this, and the solution suggested is always to have an array of size (K, t) where K is output length and t is number of threads. This does not work for me as K might be different for each thread and each thread might need to resize the array to fit all the results it gets.

Pseudocode:

for i in prange(len(starts)):

    qs, qe, qx = starts[i], ends[i], index[i]

    results = t.search(qs, qe)

    if len(results) + nfound < len(output):
        # add result to output
    else:
        # resize array
        # then add results
The Unfun Cat
  • 29,987
  • 31
  • 114
  • 156
  • Do you know an absolute upper limit for `K`? – DavidW Nov 11 '19 at 15:14
  • No, so I am guessing this is not possible. But if so that would be an acceptable answer also if I get a short explanation :) Or with q number of queries and i intervals to search in the limit is q * i I guess, but that is prohibitively large :) – The Unfun Cat Nov 11 '19 at 15:40
  • 1
    The other thing you could do is to have each thread have a thread local C array using `malloc`, `realloc` and `free`. [This answer](https://stackoverflow.com/questions/30117132/cython-parallel-cannot-see-the-difference-in-speed/30118980#30118980) illustrates some but not all of the idea. I don't have a much enthusiasm in writing this up currently, but hopefully it's enough for you to make a start (and write an an answer yourself when you get it to work :) ) – DavidW Nov 11 '19 at 15:53

1 Answers1

3

An usual pattern is that every thread gets its own container, which is a trade-off between speed/complexity and memory-overhead:

  1. there is no need to lock for access to this container, because only one thread accesses it.
  2. there is much less overhead compared to "own container for every task (i.e. every i-value)".

After the parallel section, the data must be either collected in a final container in a post processing step (which also could happen in parallel) or the subsequent algorithms should be able to handle a collection of containers.

Here is an example using c++-vector (which already has memory management and increasing size built-in):

%%cython -+ -c=/openmp --link-args=/openmp

from cython.parallel import prange, threadid
from libcpp.vector cimport vector
cimport openmp

def calc_in_parallel(N):    
    cdef int i,k,tid
    cdef int n = N
    cdef vector[vector[int]] vecs
    # every thread gets its own container
    vecs.resize(openmp.omp_get_max_threads())
    for i in prange(n, nogil=True):  
        tid = threadid()
        for k in range(i):
            # use container of the thread
            vecs[tid].push_back(k) # dummy for calculation

    return vecs

Using omp_get_max_threads() for the number of threads will overestimate the real number of threads in many cases. It is probably more robust to set the number of threads explicitly in prange, i.e.

...
NUM_THREADS = 2
vecs.resize(NUM_THREADS)
for i in prange(n, nogil=True, num_threads = NUM_THREADS): 
...

A similar approach can be applied using pure C, but more boiler plate code (memory management) will be needed in this case.

ead
  • 32,758
  • 6
  • 90
  • 153
  • I thought the use of `vector` for a "per-thread" variable was a nice approach (definitely better than malloc/free if you're already using C++). I also thought it could be applied to [this question](https://stackoverflow.com/questions/59005318/cython-parallelisation-race-condition-for-dfs) if you have an interest in posting it there. – DavidW Nov 28 '19 at 13:05