Questions tagged [cuda]

CUDA (Compute Unified Device Architecture) is a parallel computing platform and programming model for NVIDIA GPUs (Graphics Processing Units). CUDA provides an interface to NVIDIA GPUs through a variety of programming languages, libraries, and APIs.

CUDA is Nvidia's parallel computing platform and programming model for GPUs (Graphics Processing Units). CUDA provides an interface to Nvidia GPUs through a variety of programming languages, libraries, and APIs. Before posting CUDA questions, please read "How to get useful answers to your CUDA questions" below.

CUDA has an online documentation repository, updated with each release, including references for APIs and libraries; user guides for applications; and a detailed CUDA C/C++ Programming Guide.

The CUDA platform enables application development using several languages and associated APIs, including:

There also exist third-party bindings for using CUDA in other languages and programming environments, such as Managed CUDA for .NET languages (including C#).

You should ask questions about CUDA here on Stack Overflow, but if you have bugs to report you should discuss them on the CUDA forums or report them via the registered developer portal. You may want to cross-link to any discussion here on SO.

The CUDA execution model is not multithreading in the usual sense, so please do not tag CUDA questions with unless your question involves thread safety of the CUDA APIs, or the use of both normal CPU multithreading and CUDA together.

How to get useful answers to your CUDA questions

Here are a number of suggestions to users new to CUDA. Follow these suggestions before asking your question and you are much more likely to get a satisfactory answer!

  • Always check the result codes returned by CUDA API functions to ensure you are getting cudaSuccess. If you are not, and you don't know why, include the information about the error in your question. This includes checking for errors caused by the most recent kernel launch, which may not be available before you've called cudaDeviceSynchronize() or cudaStreamSynchronize(). More on checking for errors in CUDA in this question.
  • If you are getting unspecified launch failure it is possible that your code is causing a segmentation fault, meaning the code is accessing memory that is not allocated for the code to use. Try to verify that the indexing is correct and check if the CUDA Compute Sanitizer (or legacy cuda-memcheck on older GPUs until CUDA 12) is reporting any errors. Note that both tools encompass more than the default Memcheck. Other tools (Racecheck, Initcheck, Synccheck) must be selected explicitly.
  • The debugger for CUDA, , is also very useful when you are not really sure what you are doing. You can monitor resources by warp, thread, block, SM and grid level. You can follow your program's execution. If a segmentation fault occurs in your program, can help you find where the crash occurred and see what the context is. If you prefer a GUI for debugging, there are IDE plugins/editions for/of Visual Studio (Windows), Visual Studio Code (Windows/Mac/Linux, but GPU for debugging must be on a Linux system) and Eclipse (Linux).
  • If you are finding that you are getting syntax errors on CUDA keywords when compiling device code, make sure you are compiling using nvcc (or clang with CUDA support enabled) and that your source file has the expected .cu extension. If you find that CUDA device functions or feature namespaces you expect to work are not found (atomic functions, warp voting functions, half-precision arithmetic, cooperative groups, etc.), ensure that you are explicitly passing compilation arguments which enable architecture settings which support those features.

Books

14278 questions
5
votes
2 answers

cuda python GPU numbapro 3d loop poor performance

I am trying to set up a 3D loop with the assignment C(i,j,k) = A(i,j,k) + B(i,j,k) using Python on my GPU. This is my GPU: http://www.geforce.com/hardware/desktop-gpus/geforce-gt-520/specifications The sources I'm looking at / comparing with…
Charles
  • 947
  • 1
  • 15
  • 39
5
votes
1 answer

cuda shared memory - inconsistent results

I'm trying to do a parallel reduction to sum an array in CUDA. Currently i pass an array in which to store the sum of the elements in each block. This is my code: #include #include #include #include…
Noel
  • 730
  • 6
  • 15
5
votes
1 answer

What is the clock measure by clock() and clock64() in CUDA?

What is the clock measure by clock() and clock64() in CUDA ? According to CUDA documentation the clock is 'per-multiprocessor counter'. According to my understanding this refers to Primary GPU clock (not the shader clock). But when I measure…
Optimus
  • 415
  • 4
  • 19
5
votes
2 answers

Why am I getting dynamic initialization not supported for __device__, __constant__, __shared__?

I don't understand why am I getting the error dynamic initialization is not supported for __device__, __constant__, __shared__ variables when compiling my code. My code looks like wrapper.cu #include "../Params.hpp" __constant__ Params…
BRabbit27
  • 6,333
  • 17
  • 90
  • 161
5
votes
2 answers

Why should I use the CUDA Driver API instead of CUDA Runtime API?

Why should I use the CUDA Driver API, and in which cases I can't use CUDA Runtime API (which is more convenient than Driver API)?
Alex
  • 12,578
  • 15
  • 99
  • 195
5
votes
1 answer

Cuda: pinned memory zero copy problems

I tried the code in this link Is CUDA pinned memory zero-copy? The one who asked claims the program worked fine for him But does not work the same way on mine the values does not change if I manipulate them in the kernel. Basically my problem is,…
Roshan
  • 548
  • 1
  • 3
  • 11
5
votes
2 answers

Cuda: If I use only .x of block and threads, will it still use all available threads in GPU or for that using .y and .z of thread and block is a must?

I have my program which requires maximum use of GPU. So, does blockDim.x * blockIdx.x + threadIdx.x; is able to access all the threads? or it is a must to use .y and .z also ? is that mandatory?
Roshan
  • 548
  • 1
  • 3
  • 11
5
votes
4 answers

Getting array subsets efficiently

Is there an efficient way to take a subset of a C# array and pass it to another peice of code (without modifying the original array)? I use CUDA.net which has a function which copies an array to the GPU. I would like to e.g. pass the function a 10th…
Morten Christiansen
  • 19,002
  • 22
  • 69
  • 94
5
votes
1 answer

Is there any way or even possible to get the overall utilization of a GPU during a period of time?

I am trying to get the information about the overall utilization of a GPU (mine is an NVIDIA Tesla K20, running on Linux) during a period of time. By "overall" I mean something like, how many streaming multi-processors are scheduled to run, and how…
rsm
  • 103
  • 1
  • 6
5
votes
3 answers

cuda device selection with multiple cpu threads

Can you tell me how cuda runtime chooses GPU device if 2 or more host threads use cuda runtime? does the runtime choose separate GPU devices for each thread? does GPU device needs to be set explicitly? Thanks
Anycorn
  • 50,217
  • 42
  • 167
  • 261
5
votes
1 answer

Compile cuda file error: "runtime library" mismatch value 'MDd_DynamicDebug' doesn't match value 'MTd_StaticDebug' in vectorAddition_cuda.o

I tried to compile a cuda file in Qt 5.2 and MSVC2012 environment. Before I started my project, I carefully read the question and reply in :Compiling Cuda code in Qt Creator on Windows. But there are still some errors popping out even though I…
jiangstonybrook
  • 133
  • 2
  • 9
5
votes
1 answer

openCV 2.4.9 compilation error with CUDA 6.5

I am running an ubuntu 14.04 system with CUDA 6.5 installed. I am trying to use the gpu implementation of feature matching of OpenCV library and my openCV library version is 2.4.9. cmake .. is ok but when I want to make project it gives me errors…
afsaneh R
  • 51
  • 1
  • 3
5
votes
1 answer

Global Memory Load/Store Efficiency and Global Memory Coalescence

I have the following simple code: #include #define BLOCKSIZE_X 32 #define BLOCKSIZE_Y 1 int iDivUp(int a, int b) { return ((a % b) != 0) ? (a / b + 1) : (a / b); } #define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); } inline…
Vitality
  • 20,705
  • 4
  • 108
  • 146
5
votes
1 answer

C++11 standard with CUDA 6.0

I want to use the C++11 standard for my C++ files in my CUDA 6.0 project. When I change the compiler in the CUDA 6.0 Nsight Eclipse settings to the g++ and add the -std=c++11 option then I receive lot of errors like this: error: namespace…
Peter VARGA
  • 4,780
  • 3
  • 39
  • 75
5
votes
2 answers

Add CUDA to ROS Package

I would like to use cuda within a ros package. Has anyone a simple example for me? I tried to built a static library with the cuda function and add this library to my package, but I get always a linking error: Undefined reference cuda... I have…
user2333894
  • 168
  • 2
  • 10