CUDA dynamic parallelism is computing sequentially

Question

I need to write an application that computes some matrices from other matrices. In general, it sums outer products of rows of initial matrix E and multiplies it by some numbers calculated from v and t for each t in a given range. I am a newbie to CUDA, so there might be some very wrong ideas in the implementation. So, there is my code and some explanation in comments:

#include <cupy/complex.cuh>
#include <thrust/device_vector.h>
#include <thrust/functional.h>
#include <thrust/sequence.h>
#include <thrust/transform.h>
const int BLOCK_SIZE = 16;
const int DIM_SIZE = 16;
const double d_hbar=1.0545718176461565e-34;

extern "C"
struct sum_functor { //sum functor for thrust::transform, summs array of matrices
    int N;
    complex<float> *arr;
    complex<float> *result;
    __host__ __device__ sum_functor(int _N, complex<float>* _arr, complex<float>* _result) : N(_N), arr(_arr), result(_result) {};

    __host__ __device__ complex<float> operator()(int n){
        complex<float> sum = result[n];
            for (int i = 0; i < BLOCK_SIZE; i++) {
                sum += arr[N * N * i + n];
            }
        return sum;
    }
};


extern "C" //outer product multiplied by given exp and rho
__global__ void outer(const complex<float>* E, int size, complex<float>* blockResult, 
                        int m, int n, complex<float> exp, complex<float> rho) {
    int col = blockIdx.y*blockDim.y+threadIdx.y;
    int row = blockIdx.x*blockDim.x+threadIdx.x;
    if (row < size && col < size) {
        blockResult[row * size + col] = exp * rho * E[m * size + row] * E[n * size + col];
    }
}

//compute constants and launch outer product kernels
//outer products are stored in blockResult, i.e. 16 matrices one after another
extern "C"
__global__ void calcBlock(const complex<float>* v, const complex<float>* E, int size, double t,
                    int rhoind, complex<float>* blockResult, int blockInd) {
    int i = threadIdx.x;
    int j = i + blockInd;
    int m = j / size;
    int n = j % size;
    if (m < size && n < size) {
        const complex<float>hbar(d_hbar);
        complex<float> exp = thrust::exp(complex<float>(0, -1)*(v[m] - v[n]) * complex<float>(t)/hbar);
        complex<float> rho = E[m * size + rhoind] * E[n * size + rhoind];
        dim3 dimGrid((size - 1)/DIM_SIZE + 1, (size - 1) / DIM_SIZE + 1, 1);
        dim3 dimBlock(DIM_SIZE, DIM_SIZE, 1);
        outer<<<dimGrid, dimBlock>>>(E, size, blockResult + i * size * size, m, n, exp, rho);
    }
}

//launch the block calculation, then sum the all matrices in block and add it to the result
//repeat block by block until all size*size  matrices in total are summed
extern "C" 
__global__ void calcSum(const complex<float>* v, const complex<float>* E, int size, double t, int ind,
                    int rhoind,  complex<float>* blockResult, complex<float>* result, int* resultIndexes) {
    for (int i = 0; i < size * size; i += BLOCK_SIZE) {
        calcBlock<<<1, BLOCK_SIZE>>>(v, E, size, t, rhoind, blockResult, i);
        cudaDeviceSynchronize();
        thrust::transform(thrust::device, resultIndexes,
            resultIndexes + size * size,
                result + ind * size * size, sum_functor(size, blockResult, result + ind * size * size));

    }
}

//launch calcSum in parallel for every t in range 
extern "C" 
__global__ void eigenMethod(const complex<float>* v, const complex<float>* E, int size, const double* t, int t_size,
                    int rhoind,  complex<float>* blockResult, complex<float>* result, int* resultIndexes) {
    int i = threadIdx.x;
    if (i < t_size) {
        calcSum<<<1, 1>>>(v, E, size, t[i], i, rhoind, blockResult + i * BLOCK_SIZE * size * size, result, resultIndexes);
    }
}

//main is simplified cause I am using CuPy
int main() {
    *Calculate E(size * size), v(size)*
    *t is vector of size t_size*
    *Initialize blockResult(t_size*BLOCK_SIZE*size*size)*
    *resultIndexes(size*size) is enumerate from 0 to size * size)*
    *result(t_size*size*size) filled with zeros*
    eigenMetod<<<1, t_size>>>(v, E, size, t, t_size, 0, blockResult, result, resultIndexes)
}

The overall idea might be strange and stupid, but it is working. Thus, the problem I've encountered is that all calcSum kernels that are called from eigenMethod are scheduled one after another.

The calcSum function and everything above works fast enough for the purposes for which it was created. The main problem is that when I am trying to call multiple of these in the eigenMethod function. I have tried benchmarking it and got a linear dependence between runtime and the number of calls. For example, the eigenMethod function with t_size = 32 works almost two times faster than with t_size = 64. Also, I have tried profiling it, but did not get the information that I wanted since Nsight Systems does not support CDP (CUDA Dynamic Parallelism) according to the the topics I saw. I think that accessing the same part of global memory (arrays E and v are the same pointer for all functions I call) might be a problem. As a hotfix, I have created individual copies for every calcSum function, but it did not help. Is there a way to compute multiple calcSum kernels in parallel? The benchmark results are listed below (matrix size is 128x128):

t_size	time, s
1	0.32
4	1.036
8	1.9
16	3.6

Using thrust::device in device code (i.e. using CDP) is deprecated in the newest versions of Thrust, as CUDA 12 changes the dynamic parallelism API. So developing code now with the old CDP API should be avoided if one wants the code to be future-proof. — paleonix, Jan 13 '23 at 14:42
Doing a device-side launch with `<<<1, 1>>>` instead of using a device function doesn't make any sense to me. — paleonix, Jan 13 '23 at 14:44
@paleonix, Thanks for answering! My question is not that much about reduction, but about computing multiple of the same functions (including reduction in itself) in parallel. As I mentioned in topic, if I launch them as shown above, they end up computing sequentially. `<<<1,1>>>` is a mistake, thank you. — Daniil Tarpanov, Jan 13 '23 at 15:01
CDP is not for free. I would only use it in cases where at the time of the first launch from the host side you don't yet know the amount of parallelism, i.e. it is data-dependent. Otherwise you can and should use normal launches. — paleonix, Jan 13 '23 at 15:12
Instead of putting everything into kernels, write device functions. Also you might want to use [CUB's block-level algorithms](https://nvlabs.github.io/cub/group___block_module.html) instead of Thrust with CDP. — paleonix, Jan 13 '23 at 15:14
FYI: "Explicit synchronization with child kernels from a parent block (i.e. using `cudaDeviceSynchronize()` in device code) is deprecated in CUDA 11.6 and removed for compute_90+ compilation. For compute capability < 9.0, compile-time opt-in by specifying `-DCUDA_FORCE_CDP1_IF_SUPPORTED` is required to continue using `cudaDeviceSynchronize()` in device code. Note that this is slated for full removal in a future CUDA release." [CUDA C++ Programming Guide](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#synchronization-cdp1) — paleonix, Jan 13 '23 at 15:26

paleonix · Accepted Answer · 2023-01-13T16:43:06.210

By not specifying a stream you are using the default stream which is shared among threads of the same block. Therefore all launches of calcSum go into the same stream and have to be executed after another. This can be fixed by using explicit streams instead.

"[A]ccessing the same part of global memory" from multiple kernels has nothing to do with this. As long as the kernels are only reading from the same locations and not writing to them, this is not problematic at all. Writing to the same locations would cause race conditions and therefore potentially non-deterministic output, but the CUDA runtime can not detect this and wont "sequentialize" your kernels to get around it.

As discussed in the comments, I don't think CDP is needed here at all and it is potentially expensive and in this form not future-proof. So performance will most probably not be ideal.

Thank you very much! Using multiple explicit streams made it much faster! Looks like there is a limit to the number of streams you can use effectively. I have not tried rewriting it without CDP yet, but I think it is close to it's peak performance, considering that asymptotic complexity is `O(n^4)`. Anyways, lots of new information for me, appreciate it! — Daniil Tarpanov, Jan 14 '23 at 16:27

CUDA dynamic parallelism is computing sequentially

1 Answers1