Questions tagged [cuda]

CUDA (Compute Unified Device Architecture) is a parallel computing platform and programming model for NVIDIA GPUs (Graphics Processing Units). CUDA provides an interface to NVIDIA GPUs through a variety of programming languages, libraries, and APIs.

CUDA is Nvidia's parallel computing platform and programming model for GPUs (Graphics Processing Units). CUDA provides an interface to Nvidia GPUs through a variety of programming languages, libraries, and APIs. Before posting CUDA questions, please read "How to get useful answers to your CUDA questions" below.

CUDA has an online documentation repository, updated with each release, including references for APIs and libraries; user guides for applications; and a detailed CUDA C/C++ Programming Guide.

The CUDA platform enables application development using several languages and associated APIs, including:

There also exist third-party bindings for using CUDA in other languages and programming environments, such as Managed CUDA for .NET languages (including C#).

You should ask questions about CUDA here on Stack Overflow, but if you have bugs to report you should discuss them on the CUDA forums or report them via the registered developer portal. You may want to cross-link to any discussion here on SO.

The CUDA execution model is not multithreading in the usual sense, so please do not tag CUDA questions with unless your question involves thread safety of the CUDA APIs, or the use of both normal CPU multithreading and CUDA together.

How to get useful answers to your CUDA questions

Here are a number of suggestions to users new to CUDA. Follow these suggestions before asking your question and you are much more likely to get a satisfactory answer!

  • Always check the result codes returned by CUDA API functions to ensure you are getting cudaSuccess. If you are not, and you don't know why, include the information about the error in your question. This includes checking for errors caused by the most recent kernel launch, which may not be available before you've called cudaDeviceSynchronize() or cudaStreamSynchronize(). More on checking for errors in CUDA in this question.
  • If you are getting unspecified launch failure it is possible that your code is causing a segmentation fault, meaning the code is accessing memory that is not allocated for the code to use. Try to verify that the indexing is correct and check if the CUDA Compute Sanitizer (or legacy cuda-memcheck on older GPUs until CUDA 12) is reporting any errors. Note that both tools encompass more than the default Memcheck. Other tools (Racecheck, Initcheck, Synccheck) must be selected explicitly.
  • The debugger for CUDA, , is also very useful when you are not really sure what you are doing. You can monitor resources by warp, thread, block, SM and grid level. You can follow your program's execution. If a segmentation fault occurs in your program, can help you find where the crash occurred and see what the context is. If you prefer a GUI for debugging, there are IDE plugins/editions for/of Visual Studio (Windows), Visual Studio Code (Windows/Mac/Linux, but GPU for debugging must be on a Linux system) and Eclipse (Linux).
  • If you are finding that you are getting syntax errors on CUDA keywords when compiling device code, make sure you are compiling using nvcc (or clang with CUDA support enabled) and that your source file has the expected .cu extension. If you find that CUDA device functions or feature namespaces you expect to work are not found (atomic functions, warp voting functions, half-precision arithmetic, cooperative groups, etc.), ensure that you are explicitly passing compilation arguments which enable architecture settings which support those features.

Books

14278 questions
5
votes
3 answers

Can I somehow run X11 on the Intel integrated graphics in my optimus laptop and debug CUDA code on the NVIDIA GPU?

I know I can debug CUDA on linux using cuda-gdb without GUI, but that's not really convenient. I also know that one can debug CUDA with Nsight Eclipse edition if X server is running on other GPU. So I have dual GPU laptop (geforce 525m and Intel…
Ognjen Kocic
  • 118
  • 2
  • 9
5
votes
1 answer

When to use volatile with register/local variables

What is the meaning of declaring register arrays in CUDA with volatile qualifier? When I tried with volatile keyword with a register array, it removed the number of spilled register memory to local memory. (i.e. Force the CUDA to use registers…
warunapww
  • 966
  • 4
  • 18
  • 38
5
votes
1 answer

About CUDA's architecture (SM, SP)

I am a person just starting the CUDA programming. There seems to be a concept of SP SM and the CUDA architecture. I'd tried to run the deviceQuery.cpp of sample source I think what works and SP SM development of their environment, It has become not…
kuu
  • 197
  • 1
  • 4
  • 14
5
votes
1 answer

Creating a cuda stream on each host thread (multi-threaded CPU)

I have a multi-threaded CPU and I would like each thread of the CPU to be able to launch a seperate CUDA stream. The seperate CPU threads will be doing different things at different times so there is a chance that they won't overlap but if they do…
Miggy
  • 79
  • 1
  • 6
5
votes
1 answer

How can I use GPU-DMA from GPU-CUDA code to copying data?

With CUDA SDK 5.5 I can use to copying data: from host: cudaMemcpy(); to use GPU-DMA if memory pinned from host: memcpy(); or cudaMemcpy(); to use CPU Cores if memory isn't pinned from gpu: for() { dst[i] = src[i]; } or memcpy(); to use GPU…
Alex
  • 12,578
  • 15
  • 99
  • 195
5
votes
1 answer

Transforming a one-dimensional, "flattened" index into the N-dimensional vector index of an N-dimensional array

I have an N-dimensional array, with the same number of items (i.e. the same "length") in each dimension. Given a one-dimensional index into the array, I want a function that returns the coordinates associated with that index. The way that the array…
weemattisnot
  • 889
  • 5
  • 16
5
votes
5 answers

CUDA allocating array of arrays

I have some trouble with allocate array of arrays in CUDA. void ** data; cudaMalloc(&data, sizeof(void**)*N); // allocates without problems for(int i = 0; i < N; i++) { cudaMalloc(data + i, getSize(i) * sizeof(void*)); // seg fault is…
user216179
5
votes
2 answers

How to separate the kernel file CUDA with the main .cpp file

When I build the code with kernelAdd() function and main() function in the same file mainFunc.cu, it's ok. But when I separate the kernelAdd() function in the kernelAdd.cu file and the main file in main.cpp file, it's built with the 2 errors: "error…
HongTu
  • 55
  • 1
  • 7
5
votes
1 answer

Defining templated constant variables in cuda

How do I implement templated constant variable in cuda. I have a struct template mystruct{ T d1; T d2[10];} I want to have a constant variable with the above struct and use a code something like below (code may not be correct at this…
user1612986
  • 1,373
  • 3
  • 22
  • 38
5
votes
1 answer

Why is cuFFT so slow?

I'm hoping to accelerate a computer vision application that computes many FFTs using FFTW and OpenMP on an Intel CPU. However, for a variety of FFT problem sizes, I've found that cuFFT is slower than FFTW with OpenMP. In the experiments and…
solvingPuzzles
  • 8,541
  • 16
  • 69
  • 112
5
votes
1 answer

how can I use cudaStreamAddCallback() with a class member method?

I'm trying to synchronise my cuda routine by using cudaStreamAddCallback(), but I can't implement it, also because the documentation is not unambiguous. The cuda-C-programming-guide says that the callback has to be defined as: void CUDART_CB…
GregPhil
  • 475
  • 1
  • 8
  • 20
5
votes
2 answers

Declaring Variables in a CUDA kernel

Say you declare a new variable in a CUDA kernel and then use it in multiple threads, like: __global__ void kernel(float* delt, float* deltb) { int i = blockIdx.x * blockDim.x + threadIdx.x; float a; a = delt[i] + deltb[i]; a += 1; } and the kernel…
John W.
  • 153
  • 2
  • 8
5
votes
1 answer

Can you Program/Test CUDA in a Virtual Machine?

I ask this as a programming and environment question. Can you test/program CUDA within a virtual machine accessing the physical GPU card? I am buying a new (really nice system) to, in part, experiment with basic CUDA programming. The processor will…
SaB
  • 747
  • 1
  • 9
  • 25
5
votes
2 answers

Integrating CUDA into a C++ application to use existing C++ class

I have an existing application that uses a C++ class, a C++ wrapper, and FORTRAN code for the computationally intensive parts of the application. I would like to implement parts of the FORTRAN in CUDA to take advantage of parallelization, but I…
John W.
  • 153
  • 2
  • 8
5
votes
2 answers

CUDA - how much slower is transferring over PCI-E?

If I transfer a single byte from a CUDA kernel to PCI-E to the host (zero-copy memory), how much is it slow compared to transferring something like 200 Megabytes? What I would like to know, since I know that transferring over PCI-E is slow for a…
Marco A.
  • 43,032
  • 26
  • 132
  • 246