Questions tagged [cuda]

CUDA (Compute Unified Device Architecture) is a parallel computing platform and programming model for NVIDIA GPUs (Graphics Processing Units). CUDA provides an interface to NVIDIA GPUs through a variety of programming languages, libraries, and APIs.

CUDA is Nvidia's parallel computing platform and programming model for GPUs (Graphics Processing Units). CUDA provides an interface to Nvidia GPUs through a variety of programming languages, libraries, and APIs. Before posting CUDA questions, please read "How to get useful answers to your CUDA questions" below.

CUDA has an online documentation repository, updated with each release, including references for APIs and libraries; user guides for applications; and a detailed CUDA C/C++ Programming Guide.

The CUDA platform enables application development using several languages and associated APIs, including:

There also exist third-party bindings for using CUDA in other languages and programming environments, such as Managed CUDA for .NET languages (including C#).

You should ask questions about CUDA here on Stack Overflow, but if you have bugs to report you should discuss them on the CUDA forums or report them via the registered developer portal. You may want to cross-link to any discussion here on SO.

The CUDA execution model is not multithreading in the usual sense, so please do not tag CUDA questions with unless your question involves thread safety of the CUDA APIs, or the use of both normal CPU multithreading and CUDA together.

How to get useful answers to your CUDA questions

Here are a number of suggestions to users new to CUDA. Follow these suggestions before asking your question and you are much more likely to get a satisfactory answer!

  • Always check the result codes returned by CUDA API functions to ensure you are getting cudaSuccess. If you are not, and you don't know why, include the information about the error in your question. This includes checking for errors caused by the most recent kernel launch, which may not be available before you've called cudaDeviceSynchronize() or cudaStreamSynchronize(). More on checking for errors in CUDA in this question.
  • If you are getting unspecified launch failure it is possible that your code is causing a segmentation fault, meaning the code is accessing memory that is not allocated for the code to use. Try to verify that the indexing is correct and check if the CUDA Compute Sanitizer (or legacy cuda-memcheck on older GPUs until CUDA 12) is reporting any errors. Note that both tools encompass more than the default Memcheck. Other tools (Racecheck, Initcheck, Synccheck) must be selected explicitly.
  • The debugger for CUDA, , is also very useful when you are not really sure what you are doing. You can monitor resources by warp, thread, block, SM and grid level. You can follow your program's execution. If a segmentation fault occurs in your program, can help you find where the crash occurred and see what the context is. If you prefer a GUI for debugging, there are IDE plugins/editions for/of Visual Studio (Windows), Visual Studio Code (Windows/Mac/Linux, but GPU for debugging must be on a Linux system) and Eclipse (Linux).
  • If you are finding that you are getting syntax errors on CUDA keywords when compiling device code, make sure you are compiling using nvcc (or clang with CUDA support enabled) and that your source file has the expected .cu extension. If you find that CUDA device functions or feature namespaces you expect to work are not found (atomic functions, warp voting functions, half-precision arithmetic, cooperative groups, etc.), ensure that you are explicitly passing compilation arguments which enable architecture settings which support those features.

Books

14278 questions
5
votes
1 answer

Random generator & CUDA

I have a question regarding the random generators in CUDA . I am using Curand to generate random numbers with the following code: __device__ float priceValue(int threadid){ unsigned int seed = threadid ; curandState s; curand_init…
ALFRAM
  • 193
  • 2
  • 14
5
votes
2 answers

Generating AES (AES-256) Lookup Tables

I am trying to implement AES-256 in CTR mode using nVidia CUDA. I have successfully coded CPU code for key expansion and now I need to implement the actual AES-256 algorithm. According to Wikipedia, some codes I've seen and particularly this PDF…
Momonga
  • 1,843
  • 2
  • 15
  • 13
5
votes
1 answer

How to make CUDA dll that can be used in C# application?

It would be good if you could give me a brief tutorial instead of a few words. My CUDA application is working as I wanted. Now, the problem is how to export CUDA code to C# as I would like to make front end and everything else in C#. From this…
Antun Tun
  • 1,507
  • 5
  • 20
  • 38
5
votes
2 answers

Let nvidia K20c use old stream management way?

From K20 different streams becomes fully concurrent(used to be concurrent on the edge). However My program need the old way. Or I need to do a lot of synchronization to solve the dependency problem. Is it possible to switch stream management to the…
worldterminator
  • 2,968
  • 6
  • 33
  • 52
5
votes
1 answer

cudaMallocHost / cudaHostAlloc on multi GPU

In CUDA docs, specifically in CUDA Runtime API in section Device Management about cudaSetDevice, it is written like this Any host memory allocated from this host thread using cudaMallocHost() or cudaHostAlloc() or cudaHostRegister() will have its…
xnov
  • 143
  • 9
5
votes
1 answer

Histogram calculation with Thrust

If i is a random walk like below (each index is not unique), and there is a device vector A filled with zeros. {0, 1, 0, 2, 3, 3, ....} Is it possible that thrust can make A[i] auto-increment, after the operation A may look like //2 means…
user1995868
  • 233
  • 1
  • 2
  • 11
5
votes
1 answer

Cuda Render Buffer Interop for depth component

What I am trying to do is to use OpenGL to perform some rendering, then use CUDA to perform some read-only post-processing (computations) directly on the rendered RGB and depth components, without copying the data to a PBO. To do this, I create a…
BenP
  • 53
  • 5
5
votes
1 answer

How to use template functions and CUDA

So I have the following code: File: Cuda.cu template __global__ void xpy( int n, T *x, T *y, T *r ) { int i = blockIdx.x * blockDim.x + threadIdx.x; if (i < n) r[i] = x[i] + y[i]; } mtx_mtx_add( float *a1, float *a2, float *r,…
aCuria
  • 6,935
  • 14
  • 53
  • 89
5
votes
1 answer

CUDA 5 and Visual Studio 2010 intellisense error

I have installed CUDA 5 toolkit (32 and 64 bit as that seemed to work) and have made a CUDA runtime project in VS 2010, it compiles fine and runs but I get a red line under the call to the CUDA function. It isn't a massive deal but it is a little…
Kevin Orriss
  • 1,012
  • 3
  • 11
  • 24
5
votes
1 answer

CUDA installation on MAC OS X without GPU (for cuda emulator)

I'm installing CUDA on MAC OS X by following the link below: http://docs.nvidia.com/cuda/cuda-getting-started-guide-for-mac-os-x/index.html It says that I must have CUDA-enabled GPU before installing.I don't have a GPU in my MacBook Pro, and I want…
aqavi_paracha
  • 1,131
  • 2
  • 17
  • 38
5
votes
3 answers

Scaling in inverse FFT by cuFFT

Whenever I'm plotting the values obtained by a programme using the cuFFT and comparing the results with that of Matlab, I'm getting the same shape of graphs and the values of maxima and minima are getting at the same points. However, the values…
Ani
  • 45
  • 1
  • 6
5
votes
3 answers

How to set CUDA compiler flags in Visual Studio 2010?

After persistently getting error : identifier "atomicAdd" is undefined, I've found the solution to be to compile with -arch sm_20 flag. But how to pass this compiler flag in VS 2010? I have tried like so under Project > Properties: But this…
mchen
  • 9,808
  • 17
  • 72
  • 125
5
votes
1 answer

CUDA link error: unresolved external but external is specified in *.cu file

Using Cuda 5.0, VS2010 This project compiles and links fine in VS2012 but VS2012 does not support Nsight debug so i am also developing in VS2010. So I have a VS2010 project file but am using identical source codes files (.h, .cpp, .cu, .cuh. VS2010…
JPM
  • 445
  • 1
  • 5
  • 15
5
votes
1 answer

How to initialise CUDA Thrust vector without implicitly invoking 'copy'?

I have a pointer int *h_a which references a large number N of data points (on host) that I want to copy to device. So I do: thrust::host_vector ht_a(h_a, h_a + N); thrust::device_vector dt_a = ht_a; However, creating ht_a seems to…
mchen
  • 9,808
  • 17
  • 72
  • 125
5
votes
2 answers

Can you use Amazon EC2 GPU instances for real-time rendering?

I need a remote PC/server which has a decent 3D card in it, to perform real-time 3D rendering... imagine running a 3D game on a remote server and that's a good comparison. Most VPS and dedicated servers do not have good graphics capabilities for…
Mr. Boy
  • 60,845
  • 93
  • 320
  • 589