Questions tagged [cuda]

CUDA (Compute Unified Device Architecture) is a parallel computing platform and programming model for NVIDIA GPUs (Graphics Processing Units). CUDA provides an interface to NVIDIA GPUs through a variety of programming languages, libraries, and APIs.

CUDA is Nvidia's parallel computing platform and programming model for GPUs (Graphics Processing Units). CUDA provides an interface to Nvidia GPUs through a variety of programming languages, libraries, and APIs. Before posting CUDA questions, please read "How to get useful answers to your CUDA questions" below.

CUDA has an online documentation repository, updated with each release, including references for APIs and libraries; user guides for applications; and a detailed CUDA C/C++ Programming Guide.

The CUDA platform enables application development using several languages and associated APIs, including:

There also exist third-party bindings for using CUDA in other languages and programming environments, such as Managed CUDA for .NET languages (including C#).

You should ask questions about CUDA here on Stack Overflow, but if you have bugs to report you should discuss them on the CUDA forums or report them via the registered developer portal. You may want to cross-link to any discussion here on SO.

The CUDA execution model is not multithreading in the usual sense, so please do not tag CUDA questions with unless your question involves thread safety of the CUDA APIs, or the use of both normal CPU multithreading and CUDA together.

How to get useful answers to your CUDA questions

Here are a number of suggestions to users new to CUDA. Follow these suggestions before asking your question and you are much more likely to get a satisfactory answer!

  • Always check the result codes returned by CUDA API functions to ensure you are getting cudaSuccess. If you are not, and you don't know why, include the information about the error in your question. This includes checking for errors caused by the most recent kernel launch, which may not be available before you've called cudaDeviceSynchronize() or cudaStreamSynchronize(). More on checking for errors in CUDA in this question.
  • If you are getting unspecified launch failure it is possible that your code is causing a segmentation fault, meaning the code is accessing memory that is not allocated for the code to use. Try to verify that the indexing is correct and check if the CUDA Compute Sanitizer (or legacy cuda-memcheck on older GPUs until CUDA 12) is reporting any errors. Note that both tools encompass more than the default Memcheck. Other tools (Racecheck, Initcheck, Synccheck) must be selected explicitly.
  • The debugger for CUDA, , is also very useful when you are not really sure what you are doing. You can monitor resources by warp, thread, block, SM and grid level. You can follow your program's execution. If a segmentation fault occurs in your program, can help you find where the crash occurred and see what the context is. If you prefer a GUI for debugging, there are IDE plugins/editions for/of Visual Studio (Windows), Visual Studio Code (Windows/Mac/Linux, but GPU for debugging must be on a Linux system) and Eclipse (Linux).
  • If you are finding that you are getting syntax errors on CUDA keywords when compiling device code, make sure you are compiling using nvcc (or clang with CUDA support enabled) and that your source file has the expected .cu extension. If you find that CUDA device functions or feature namespaces you expect to work are not found (atomic functions, warp voting functions, half-precision arithmetic, cooperative groups, etc.), ensure that you are explicitly passing compilation arguments which enable architecture settings which support those features.

Books

14278 questions
5
votes
1 answer

link cuda with gmp

I am trying to use cuda with the GNU multiple precision library (gmp). When I add gmp instructions like mpf_init() to my device code I get this compiler error: tlgmp.cu(37): error: calling a host function("__ gmpf_init") from a __ device__ /__…
5
votes
2 answers

copy from GPU to CPU is slower than copying CPU to GPU

I have started learning cuda for a while and I have the following problem See how I am doing below: Copy GPU int* B; // ... int *dev_B; //initialize B=0 cudaMalloc((void**)&dev_B, Nel*Nface*sizeof(int)); cudaMemcpy(dev_B, B,…
giorgk
  • 109
  • 1
  • 11
5
votes
1 answer

CUDA shared memory occupy twice the space than needed

I just noticed that my CUDA kernel uses exactly twice the space than that calculated by 'theory'. e.g. __global__ void foo( ) { __shared__ double t; t = 1; } PTX info shows: ptxas info : Function properties for _Z3foov, 0 bytes stack…
Rainn
  • 315
  • 1
  • 9
5
votes
2 answers

GTX 680 , Keplers and maximum registers per thread

I am asking the following questions as I am confused... On various sites and papers I am finding statements saying that the Kepler architecture has increased the amount of registers per thread, but on my GTX680 this does not seem to be true as the…
Daniel
  • 639
  • 1
  • 4
  • 17
5
votes
1 answer

Multiplying hundreds of matrices using cuda

I am writing a program which requires to multiply hundreds of matrices in parallel using CUDA. Can somebody explain how to perform this operation. I have seen that Kepler architecture is capable of dynamic parallelism. Has somebody used this…
5
votes
1 answer

CUDA cutil.h where is it?

Does anyone know which and where is the SDK/toolkits that contents cutil.h? I tried CUDA toolkits3.2 and toolkits5.0(I know this version it is not supported already for cutil.h) Also I notice some mentioned about it in how to include cutil.h in…
pandaSlayer
  • 63
  • 1
  • 5
5
votes
3 answers

Compiling Eigen library with nvcc (CUDA)

I tried to compile following program (main.cu) with the nvcc (CUDA 5.0 RC): #include #include int main( int argc, char** argv ) { std::cout << "Pure CUDA" << std::endl; } Unfortunately, I get a bunch of warnings and…
GeorgT
  • 160
  • 1
  • 9
5
votes
1 answer

The cost of CUDA global memory transactions

According to CUDA 5.0 Programming Guide, if I am using both L1 and L2 caching (on Fermi or Kepler), all global memory operations are done using 128-byte memory transactions. However, if I am using L2 only, 32-byte memory transactions are used…
CygnusX1
  • 20,968
  • 5
  • 65
  • 109
5
votes
2 answers

Can't get simple CUDA program to work

I'm trying the "hello world" program of CUDA programming: adding two vectors together. Here's the program I have tried: #include #include #define SIZE 10 __global__ void vecAdd(float* A, float* B, float* C) { int i =…
Barry Brown
  • 20,233
  • 15
  • 69
  • 105
5
votes
1 answer

CUSPARSE_STATUS_INTERNAL_ERROR with cuSparse cusparseSnnz function

I am trying to get familiar to the cuSparse library. In my simple code, the function cusparseSnnz returns the status 6 which is CUSPARSE_STATUS_INTERNAL_ERROR. I think the CUDA driver and cuSparse library are correctly installed. I would be really…
eraser
  • 91
  • 4
5
votes
1 answer

CUDA Unable to see shared memory values in Nsight debugging

I've been struggling for some time a problem I can't seem to find a solution to. The problem is that when I try to debug my CUDA code using Nvidia Nsight under Visual Studio 2008 I get strange results when using shared memory. My code…
Iam
  • 401
  • 1
  • 5
  • 13
5
votes
1 answer

Depth-first search in CUDA / OpenCL

I'm half-way through implementing parallel depth-first search algorithm in MPI and I'm thinking about trying to also do it in CUDA / OpenCL, just for fun / out of curiosity. The algorithm is simple but not trivial. The single-core version in C is…
fhucho
  • 34,062
  • 40
  • 136
  • 186
5
votes
2 answers

CUDA - Coalescing memory accesses and bus width

So the idea that I have about coalescing memory accesses in CUDA is, that threads in a warp should access contiguous memory addresses, as that will only cause a single memory transaction (the values on each address are then broadcast to the threads)…
Alexandre Dias
  • 107
  • 1
  • 6
5
votes
1 answer

Dynamically detecting a CUDA enabled NVIDIA card and only then initializing the CUDA runtime: How to do?

I have an application which has an algorithm, accelerated with CUDA. There is also a standard CPU implementation of it. We plan to release this application for various platforms, so most of the time, there won't be a NVIDIA card to run the…
Ufuk Can Bicici
  • 3,589
  • 4
  • 28
  • 57
5
votes
2 answers

Pitch alignment for 2D textures

2D textures are a useful feature of CUDA in image processing applications. To bind pitch linear memory to 2D textures, the memory has to be aligned. cudaMallocPitch is a good option for aligned memory allocation. On my device, the pitch returned by…
sgarizvi
  • 16,623
  • 9
  • 64
  • 98
1 2 3
99
100