1

I have two kernels that process some data sequentially (launched with only one thread). I want to combine the two so that I can have one kernel to launch with two threads. After doing so, I was expecting to get an exec time of max(kernel1, kernel2) but what I got was the sum of the two exec times. I narrowed down the problem to something like the code below.

#include "cuda_runtime.h"
#include "device_launch_parameters.h"

#include <stdio.h>
#include<iostream>
#include<string>
#include<vector>
#include<random>
#include<functional>
#include<algorithm>
#include<iterator>

__global__ void dummyKernel(const float *d_data_Re, const float *d_data_Im,
    float *d_out_Re, float *d_out_Im, const int dataLen) {
    int i{ threadIdx.x };
    if (i == 0) {
        printf("Thread zero started \n");
        for (int j{}; j < 1000000; j++)
            d_out_Re[j%dataLen] = sqrtf(2) + d_data_Re[j%dataLen] * (j % 4 == 1);
        printf("Thread zero finished \n");
    }
    else if (i == 1) {
        printf("Thread one started \n");
        for (int j{}; j < 1000000; j++)
            d_out_Im[j%dataLen] = sqrtf(2) + d_data_Im[j%dataLen] * (j % 4 == 1);
        printf("Thread one finished \n");
    }
}

__global__ void dummyKernel2(const float *d_data_Re, const float *d_data_Im,
    float *d_out_Re, float *d_out_Im, const int dataLen) {
    int i{ threadIdx.x };
    //if (i == 0) {
        printf("Thread zero started \n");
        for (int j{}; j < 1000000; j++)
            d_out_Re[j%dataLen] = sqrtf(2) + d_data_Re[j%dataLen] * (j % 4 == 1);
        printf("Thread zero finished \n");
    //}
    //else if (i == 1) {
    //  printf("Thread one started \n");
    //  for (int j{}; j < 1000000; j++)
    //      d_out_Im[j%dataLen] = sqrtf(2) + d_data_Im[j%dataLen] * (j % 4 == 1);
    //  printf("Thread one finished \n");
    //}
}

int main()
{
    cudaError_t cudaStatus = cudaSetDevice(0);
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaSetDevice failed!  Do you have a CUDA-capable GPU installed?");
        return 1;
    }

    const int sizeOfFrame = 2 * 1024 * 1024;
    std::vector<float> data_re(sizeOfFrame), data_im;
    //random number generator
    std::uniform_real_distribution<float> distribution(0.0f, 2.0f); //Values between 0 and 2
    std::mt19937 engine; // Mersenne twister MT19937
    auto generator = std::bind(distribution, engine);
    std::generate_n(data_re.begin(), sizeOfFrame, generator);
    std::copy(data_re.begin(), data_re.end(), std::back_inserter(data_im));
    //

    float *d_data_re, *d_data_im;
    cudaMalloc(&d_data_re, sizeOfFrame * sizeof(float));
    cudaMalloc(&d_data_im, sizeOfFrame * sizeof(float));
    cudaMemcpy(d_data_re, data_re.data(), sizeOfFrame * sizeof(float), cudaMemcpyHostToDevice);
    cudaMemcpy(d_data_im, data_im.data(), sizeOfFrame * sizeof(float), cudaMemcpyHostToDevice);

    float *d_pll_out_re, *d_pll_out_im;
    cudaMalloc(&d_pll_out_re, sizeOfFrame * sizeof(float));
    cudaMalloc(&d_pll_out_im, sizeOfFrame * sizeof(float));

    dummyKernel << <1, 2 >> >(d_data_re, d_data_im,
        d_pll_out_re, d_pll_out_im, sizeOfFrame);
    cudaDeviceSynchronize();

    // cudaDeviceReset must be called before exiting in order for profiling and
    // tracing tools such as Nsight and Visual Profiler to show complete traces.
    cudaStatus = cudaDeviceReset();
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaDeviceReset failed!");
        return 1;
    }

    return 0;
}

btw I got the code for random number generator from an answer to this question. So, the dummyKernel doesn't do anything useful, I just wanted to have a kernel that took relatively long to finish. If you launch dummyKernel, the order of the output will be "Thread zero started", "Thread zero finished", "Thread one started", "Thread one finished". Sequential. But if you launch dummyKernel2, the order of the output will be "Thread zero started", "Thread zero started", "Thread zero finished", "Thread zero finished" and the exec time is almost half as dummyKernel. I don't understand this behavior and the effect of the if-else I used. OS: Windows 10, GTX 1050 Ti, CUDA Driver/Runtime version: 11.1/10.1.

nasy
  • 78
  • 3
  • 7
  • All conditionals branches are executed serially in warps. This is expected on GPUs (and more specifically the SIMT model). Please read: https://stackoverflow.com/questions/11687500/how-does-cuda-exactly-synchronize-threads-in-a-warp-at-barriers-and-conditional (and likely the CUDA manual). – Jérôme Richard Jan 01 '22 at 10:47
  • So try to start 33 threads (0..32), let thread 1-31 do nothing and use threads 0 and 32 to see, if it makes a change (just for testing/educational purposes - that is not a good setup for parallelization) – Sebastian Jan 01 '22 at 11:23
  • @JérômeRichard Thank you, it's starting to make sense now – nasy Jan 01 '22 at 11:26
  • @Sebastian Thank you, Tested and it solved my problem. – nasy Jan 01 '22 at 12:14
  • In newer Nvidia GPUs (>= Volta) than yours (Pascal) each thread of a warp may execute different program code locations (this is called Independent Thread Scheduling), but it is still advantageous to let the threads of a warp not diverge, especially for coalesced memory accesses. And there is no guarantee that Nvidia won't revert the behaviour back again for future architectures to save die area or e.g. will have 8 subgroups of 4 threads each, which cannot diverge. – Sebastian Jan 01 '22 at 13:28
  • @Sebastian I don't know if I should ask it in a separate question, but I think it could be related to this one. If I have a __syncthreads() in an if condition on threadIdx.x (for example I want to have two group of threads to do two different things), it causes a deadlock on an RTX 3080 whereas on a GTX 1050 Ti it works fine (tested on both). How can this be explained? – nasy Jan 02 '22 at 05:38

1 Answers1

1

Each Cuda multiprocessor has execution units (several each for int, float, special functions, ...). Those work as pipelines, which take several cycles to complete a calculation, but in each cycle a new calculation can be inserted (=scheduled) and several calculations are processed at the same time at different stages of the pipeline.

Groups of 32 threads (warps) within a block are scheduled the same instruction at the same time (same cycle or often two cycles depending on how many execution and datapath resources are available on the architecture and needed for this instruction), together with a bitfield, stating, for which threads this instruction should be actively executed. If some threads of a warp evaluated an if clause as false, they are temporarily deactivated. Or some threads may have already exited the kernel.

The effect is that if the 32 warps diverge (branch differently), each execution path has to be run through for each of the 32 threads (with some threads deactivated for each path). That should be avoided for performance reasons, as the computation resources are reserved nevertheless. Threads from different warps don't have this interdependency. The algorithm should be structured in a way to consider this.

With Volta, Independent Thread Scheduling was introduced. Each thread has its own instruction counter (and manages a separate function callstack). But the scheduler still will schedule groups of 32 threads (warps) with bitfields for active threads. What changed is that the scheduler can interleave the diverging paths. Instead of executing CCCIIIEEECCC pre-Volta (instructions: C=common, I=if branch, e=else branch), it could execute CCCIEEIIECCC, if the available execution units or the memory latency better fits. As programmer, one has to be careful, as it can be no longer assumed that the threads have not diverged, even when executing the same instruction. That is why __syncwarp was introduced and all kind of cooperation functions (e.g. the shuffle instructions) got a sync variant. Nevertheless (although we cannot know for sure, if the threads diverged) one still has to program in a way that all 32 threads can work together, if executed synchronously, especially for coalesced memory accesses. Putting __syncwarp after each possibly diverging instruction can help to ensure convergence. (But do performance profiling).

The Independent Thread Scheduling is also the reason, why __syncthreads must definitely be called correctly on the RTX 3080 - with each thread participating. A typical correcting solution for the deadlock case you mentioned in the comment is to close the if clause, sync all the threads and open a new if clause with the same condition as the previous one.

Sebastian
  • 1,834
  • 2
  • 10
  • 22