1

I am trying to parallelize the computation of a metric on the nodes of a graph.

As an approach I have made each thread calculate the metric on a node (since the calculation on a node is independent).

Each thread will have to calculate the neighbors at distance one of the considered node and store them in an array of initially unknown size (and different for each node).

I can't use extern __shared__ array because each thread has to compute its own array and can't be shared.

I can't declare a (max) fixed array size because it would be very inefficient for my task.

Is there any another way to handle this array or other dynamic data structures?

This is an extract of the kernel function:

__global__ void expectedForce(int* IR_vec, int* IC_vec, int n_IR)
{   
    double ExF = 0;
    int seed = blockDim.x * blockIdx.x + threadIdx.x+1;
    
    if(seed<n_IR) {
        
        int valRiga = IR_vec[seed];
        int distOne[]; // that's the array I have to handle
    ...}

}
talonmies
  • 70,661
  • 34
  • 192
  • 269
  • There's no other way but to create a fixed array. You cannot create dynamically sized private data. And I don't understand what you mean by storing the array when the kernel runs. Besides, the threads progress in the same "speed" - so it will run with the speed of the slowest thread as all threards must process the same instructions.. What you can do is create threads that process up-to fixed amount of node neighbours and make multiple threads that process the same node if number of neighbours is large enough. – ALX23z Oct 20 '21 at 10:51
  • @ALX23z: That simply isn't true. CUDA has supported dynamically allocated private data for about 10 years via kernel side `new` or `malloc` – talonmies Oct 20 '21 at 11:16
  • @talonmies - checked it, indeed CUDA has malloc unlike opencl but its performance is absolute trash. – ALX23z Oct 20 '21 at 12:23
  • You can allocate the maximum size per block and then let the threads synchronize, which one may access which part of it. If you are running out of memory, stall some continuous threads. You could also divide your work into a first processing step, which determines the number of nodes and a second one doing the actual processing. The global memory is too slow, even when optimizing memory accesses? – Sebastian Oct 22 '21 at 12:02

0 Answers0