Is there any restrictions on CUDA kernel array length?

Question

I would like to generate an array of integers using GPU. I found some solutions in NVIDIA Documentation and based on it I wrote the simple code below. When I run it, it works well but only if arraySize variable is less or equal 291670. For greater values, calling cudaDeviceSynchronize() returns cudaErrorLaunchFailure (error 4) - "unspecified launch failure".

In my solution I need much more longer arrays.

Is this some restriction on array length or my fault?

#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <curand_kernel.h>
#include <helper_cuda.h>
#include <curand.h>
#include <stdio.h>
#include <algorithm>
#include <ctime>
#include <iostream>
#include <cstdlib>
__device__ const int MAX_THREADS_PER_BLOCK = 1024;
__device__ const int MAX_BLOCKS = 65535;
__device__ const unsigned int arraySize = 291670;
__global__ void _rndInit_(unsigned int seed, curandState_t* states, unsigned int arraySize) {
    long tid = threadIdx.x + blockIdx.x * blockDim.x;
    while (tid < arraySize) {
        curand_init(seed, tid, 0, &states[tid]);
        tid += blockDim.x * gridDim.x;
    }
}
void rndInit(unsigned int seed, curandState_t* states, int arraySize) {
    int threads = 128;
    int blocks = std::min((arraySize + threads - 1) / threads, MAX_BLOCKS);
    _rndInit_ <<< blocks, threads >>>(time(0), states, arraySize);
}
int main() {
    curandState_t* d_states;
    cudaError_t cudaStatus;
    checkCudaErrors(cudaMalloc((void**)&d_states, arraySize * sizeof(curandState_t)));

    rndInit(time(0), d_states, arraySize);

    cudaStatus = cudaDeviceSynchronize();
    if (cudaStatus != cudaSuccess) 
        std::cout << cudaStatus;
    cudaFree(d_states);
    cudaStatus = cudaDeviceReset();
    if (cudaStatus != cudaSuccess) {
        std::cout << cudaStatus;
        return 1;
    }
    return 0;
}

Your code runs fine for me even for `arraySize` of 500000. My guess is a WDDM timeout. Are you running this on windows? What GPU? Is it a debug or release project? — Robert Crovella, Mar 15 '18 at 14:52
Thanks for answering. I run it on Windows 10, GPU - GTX 1070, Driver: 388.19 and using debug project in current VS 2017 — Mariusz Maleszak, Mar 15 '18 at 15:07
You're very likely hitting a WDDM timeout. As a simple test, try running a release project and see if you can go above 291670. Probably you will be able to. You will hit some higher limit eventually. To learn how to work around the WDDM timeout issue, google "WDDM timeout" and start reading. You can disable or modify the WDDM timeout mechanism in Nsight VSE. See [here](http://docs.nvidia.com/nsight-visual-studio-edition/Nsight_Visual_Studio_Edition_User_Guide.htm#Timeout_Detection_Recovery.htm). — Robert Crovella, Mar 15 '18 at 15:13
You're right! Thank you for help! I never would have thought about it. Now I can declare even 2 million elements array, but there is still a certain limit (array size about 2,5 mln even in runtime build with WDDM TDR Delay sets to 10). It's interesting is there any other solution to regular proceeding with this problem without turning off TDR... — Mariusz Maleszak, Mar 16 '18 at 08:52
Not that I know of for your current setup. TDR limits the kernel duration. If you use a GPU that can be placed in TCC mode on windows, you can eliminate the TDR watchdog for that GPU. And in Linux, if your GPU is not hosting a display, it will not have this limit. — Robert Crovella, Mar 16 '18 at 14:33

Is there any restrictions on CUDA kernel array length?

0 Answers0