Questions tagged [cuda]

CUDA (Compute Unified Device Architecture) is a parallel computing platform and programming model for NVIDIA GPUs (Graphics Processing Units). CUDA provides an interface to NVIDIA GPUs through a variety of programming languages, libraries, and APIs.

CUDA is Nvidia's parallel computing platform and programming model for GPUs (Graphics Processing Units). CUDA provides an interface to Nvidia GPUs through a variety of programming languages, libraries, and APIs. Before posting CUDA questions, please read "How to get useful answers to your CUDA questions" below.

CUDA has an online documentation repository, updated with each release, including references for APIs and libraries; user guides for applications; and a detailed CUDA C/C++ Programming Guide.

The CUDA platform enables application development using several languages and associated APIs, including:

There also exist third-party bindings for using CUDA in other languages and programming environments, such as Managed CUDA for .NET languages (including C#).

You should ask questions about CUDA here on Stack Overflow, but if you have bugs to report you should discuss them on the CUDA forums or report them via the registered developer portal. You may want to cross-link to any discussion here on SO.

The CUDA execution model is not multithreading in the usual sense, so please do not tag CUDA questions with unless your question involves thread safety of the CUDA APIs, or the use of both normal CPU multithreading and CUDA together.

How to get useful answers to your CUDA questions

Here are a number of suggestions to users new to CUDA. Follow these suggestions before asking your question and you are much more likely to get a satisfactory answer!

  • Always check the result codes returned by CUDA API functions to ensure you are getting cudaSuccess. If you are not, and you don't know why, include the information about the error in your question. This includes checking for errors caused by the most recent kernel launch, which may not be available before you've called cudaDeviceSynchronize() or cudaStreamSynchronize(). More on checking for errors in CUDA in this question.
  • If you are getting unspecified launch failure it is possible that your code is causing a segmentation fault, meaning the code is accessing memory that is not allocated for the code to use. Try to verify that the indexing is correct and check if the CUDA Compute Sanitizer (or legacy cuda-memcheck on older GPUs until CUDA 12) is reporting any errors. Note that both tools encompass more than the default Memcheck. Other tools (Racecheck, Initcheck, Synccheck) must be selected explicitly.
  • The debugger for CUDA, , is also very useful when you are not really sure what you are doing. You can monitor resources by warp, thread, block, SM and grid level. You can follow your program's execution. If a segmentation fault occurs in your program, can help you find where the crash occurred and see what the context is. If you prefer a GUI for debugging, there are IDE plugins/editions for/of Visual Studio (Windows), Visual Studio Code (Windows/Mac/Linux, but GPU for debugging must be on a Linux system) and Eclipse (Linux).
  • If you are finding that you are getting syntax errors on CUDA keywords when compiling device code, make sure you are compiling using nvcc (or clang with CUDA support enabled) and that your source file has the expected .cu extension. If you find that CUDA device functions or feature namespaces you expect to work are not found (atomic functions, warp voting functions, half-precision arithmetic, cooperative groups, etc.), ensure that you are explicitly passing compilation arguments which enable architecture settings which support those features.

Books

14278 questions
5
votes
2 answers

Crashing a kernel gracefully

A follow up to: CUDA: Stop all other threads I'm looking for a way to exit a kernel if a "bad condition" occurs. The prog manual say NVCC does not support exception handling. I'm wondering if there is a user defined cuda-error-code. In other words…
Doug
  • 2,783
  • 6
  • 33
  • 37
5
votes
3 answers

Retaining dot product on GPGPU using CUBLAS routine

I am writing a code to compute dot product of two vectors using CUBLAS routine of dot product but it returns the value in host memory. I want to use the dot product for further computation on GPGPU only. How can I make the value reside on GPGPU only…
user1439690
  • 659
  • 1
  • 11
  • 26
5
votes
2 answers

CUDA loop over lower triangular matrix

If have a matrix and I only want to access to the lower triangular part of the matrix. I am trying to find a good thread index but so far I have not managed it. Any ideas? I need and index to loop over the lower triangular matrix, say this is my…
Manolete
  • 3,431
  • 7
  • 54
  • 92
5
votes
2 answers

cuda understanding concurrent kernel execution

I'm trying to understand how concurrent kernel execution works. I have written a simple program to try to understand it. The kernel will populate a 2D array using 2 streams. I am getting the correct results when there is 1 stream, no concurrency.…
Beau Bellamy
  • 461
  • 1
  • 8
  • 19
5
votes
1 answer

CUDA fft - cooley tukey, how is parallelism exploited?

I know how the FFT implementation works (Cooley-Tuckey algorithm) and I know that there's a CUFFT CUDA library to compute the 1D or 2D FFT quickly, but I'd like to know how CUDA parallelism is exploited in the process. Is it related to the butterfly…
Johnny Pauling
  • 12,701
  • 18
  • 65
  • 108
5
votes
2 answers

Compiling Basic C-Language CUDA code in Linux (Ubuntu)

I've spent a lot of time setting up the CUDA toolchain on a machine running Ubuntu Linux (11.04). The rig has two NVIDIA Tesla GPUs, and I'm able to compile and run test programs from the NVIDIA GPU Computing SDK such as deviceQuery,…
ndgibson
  • 119
  • 1
  • 2
  • 10
5
votes
3 answers

driver.Context.synchronize()- what else to take into consideration -- -a clean-up operation failed

I have this code here (modified due to the answer). Info 32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads ptxas info : Used 46 registers, 120 bytes cmem[0], 176 bytes cmem[2], 76 bytes cmem[16] I don't know what else to take…
George
  • 5,808
  • 15
  • 83
  • 160
5
votes
3 answers

Increasing per thread register usage in CUDA

Normally it is advised to lower the per thread register pressure to increase warp occupancy thereby providing greater opportunity to hide latency through warp level multi-threading (TLP). To decrease the register pressure, one would use more per…
nurabha
  • 1,152
  • 3
  • 18
  • 42
5
votes
1 answer

Optix dynamicly sized array in payload

Is there any way to declare a dynamically sized array payload in optix? I've googled and read the Optix documentation, only to find that Optix doesn't allow the use of malloc. Is there any way I could do something like the following? struct…
icebreeze
  • 113
  • 5
5
votes
4 answers

Multi-GPU profiling (Several CPUs , MPI/CUDA Hybrid)

I had a quick look on the forums and I don't think this question has been asked already. I am currently working with an MPI/CUDA hybrid code, made by somebody else during his PhD. Each CPU has its own GPU. My task is to gather data by running the…
VSenicourt
  • 53
  • 1
  • 5
5
votes
1 answer

fmad=false gives good performance

From Nvidia release notes: The nvcc compiler switch, --fmad (short name: -fmad), to control the contraction of floating-point multiplies and add/subtracts into floating-point multiply-add operations (FMAD, FFMA, or DFMA) has been added: …
Sayan
  • 2,662
  • 10
  • 41
  • 56
5
votes
2 answers

How to evaluate CUDA performance?

I programmed CUDA kernel my own. Compare to CPU code, my kernel code is 10 times faster than CPUs. But I have question with my experiments. Does my program fully be optimized using all GPU cores, proper shared memory use, adequate register count,…
bongmo.kim
  • 77
  • 1
  • 6
5
votes
1 answer

Print messages in PyCUDA

In simple CUDA programs we can print messages by threads by including cuPrintf.h but doing this in PyCUDA is not explained anywhere. How to do this in PyCUDA?
username_4567
  • 4,737
  • 12
  • 56
  • 92
5
votes
2 answers

CUDA: allocation of an array of structs inside a struct

I've these structs: typedef struct neuron { float* weights; int n_weights; }Neuron; typedef struct neurallayer { Neuron *neurons; int n_neurons; int act_function; }NLayer; "NLayer" struct can contain an arbitrary number of "Neuron" I've…
Andrea Sylar Solla
  • 157
  • 1
  • 2
  • 10
5
votes
1 answer

CUDA kernel call from within for loop

I have a CUDA kernel that is called from within a for loop. Something like for(i=0; i<10; i++) { myKernel<<<1000,256>>>(A,i); } Assume now that I have an NVIDIA card with 15 Stream Multiprocessors (SMs). Also assume, for simplicity, that only…
user1586099
  • 111
  • 5
1 2 3
99
100