Questions tagged [cuda]

CUDA (Compute Unified Device Architecture) is a parallel computing platform and programming model for NVIDIA GPUs (Graphics Processing Units). CUDA provides an interface to NVIDIA GPUs through a variety of programming languages, libraries, and APIs.

CUDA is Nvidia's parallel computing platform and programming model for GPUs (Graphics Processing Units). CUDA provides an interface to Nvidia GPUs through a variety of programming languages, libraries, and APIs. Before posting CUDA questions, please read "How to get useful answers to your CUDA questions" below.

CUDA has an online documentation repository, updated with each release, including references for APIs and libraries; user guides for applications; and a detailed CUDA C/C++ Programming Guide.

The CUDA platform enables application development using several languages and associated APIs, including:

There also exist third-party bindings for using CUDA in other languages and programming environments, such as Managed CUDA for .NET languages (including C#).

You should ask questions about CUDA here on Stack Overflow, but if you have bugs to report you should discuss them on the CUDA forums or report them via the registered developer portal. You may want to cross-link to any discussion here on SO.

The CUDA execution model is not multithreading in the usual sense, so please do not tag CUDA questions with unless your question involves thread safety of the CUDA APIs, or the use of both normal CPU multithreading and CUDA together.

How to get useful answers to your CUDA questions

Here are a number of suggestions to users new to CUDA. Follow these suggestions before asking your question and you are much more likely to get a satisfactory answer!

  • Always check the result codes returned by CUDA API functions to ensure you are getting cudaSuccess. If you are not, and you don't know why, include the information about the error in your question. This includes checking for errors caused by the most recent kernel launch, which may not be available before you've called cudaDeviceSynchronize() or cudaStreamSynchronize(). More on checking for errors in CUDA in this question.
  • If you are getting unspecified launch failure it is possible that your code is causing a segmentation fault, meaning the code is accessing memory that is not allocated for the code to use. Try to verify that the indexing is correct and check if the CUDA Compute Sanitizer (or legacy cuda-memcheck on older GPUs until CUDA 12) is reporting any errors. Note that both tools encompass more than the default Memcheck. Other tools (Racecheck, Initcheck, Synccheck) must be selected explicitly.
  • The debugger for CUDA, , is also very useful when you are not really sure what you are doing. You can monitor resources by warp, thread, block, SM and grid level. You can follow your program's execution. If a segmentation fault occurs in your program, can help you find where the crash occurred and see what the context is. If you prefer a GUI for debugging, there are IDE plugins/editions for/of Visual Studio (Windows), Visual Studio Code (Windows/Mac/Linux, but GPU for debugging must be on a Linux system) and Eclipse (Linux).
  • If you are finding that you are getting syntax errors on CUDA keywords when compiling device code, make sure you are compiling using nvcc (or clang with CUDA support enabled) and that your source file has the expected .cu extension. If you find that CUDA device functions or feature namespaces you expect to work are not found (atomic functions, warp voting functions, half-precision arithmetic, cooperative groups, etc.), ensure that you are explicitly passing compilation arguments which enable architecture settings which support those features.

Books

14278 questions
5
votes
1 answer

removing elements from an device_vector

thrust::device_vector values thrust::device_vector keys; After initialization, keys contains some elements equal to -1. I wanted to delete the elements in keys and in the same position of values. But I do not know how to deal with it parallel?
GaoYuan
  • 153
  • 1
  • 10
5
votes
1 answer

setting up a CUDA 2D "unsigned char" texture for linear interpolation

I have a linear array of unsigned chars representing a 2D array. I would like to place it into a CUDA 2D texture and perform (floating point) linear interpolation on it, i.e., have the texture call fetch the 4 nearest unsigned char neighbors,…
Jammy
  • 303
  • 2
  • 14
5
votes
1 answer

64 bit number support in CUDA

I kind of found various opinions on this topic, so this is why I decided to ask here. My question is starting from what computing capability is int64_t supported on CUDA. I am running cuda 5 on a Quadro770M and the following code works without a…
Zahari
  • 391
  • 4
  • 14
5
votes
1 answer

Surface reference faster than Surface object

I recently changed the surface reference of my algorithm for a surface object. Then, I noticed that the program runs slower. Here is a comparison for simple example where I fill a 3D floating array [400*400*400] with a constant value. Surface…
Arnaud
  • 53
  • 3
5
votes
1 answer

Strange behavior when detecting global memory

After reading this question: "How to differentiate between pointers to shared and global memory?", I decided to try isspacep.local, isspacep.global and isspacep.shared in a simple test program. The tests for local and shared memory work all the…
BenC
  • 8,729
  • 3
  • 49
  • 68
5
votes
1 answer

How to differentiate between pointers to shared and global memory?

In CUDA, given the value of a pointer, or the address of a variable, is there an intrinsic or another API which will introspect which address space the pointer refers to?
Jared Hoberock
  • 11,118
  • 3
  • 40
  • 76
5
votes
2 answers

cannot find -lcuda when linking with g++

I'm trying to link these object files with the command: g++ NT_FFT_Decomp.o T_FFT_Decomp.o SNT_FFT_Comp.o ST_FFT_Comp.o VNT_FFT_Comp.o VT_FFT_Comp.o CUDA_FFT_Comp.o Globals.o main.o \ -L/media/wiso/Programs/Setups/CUDA/include -lcuda -lcudart…
mewais
  • 1,265
  • 3
  • 25
  • 42
5
votes
1 answer

Invalid device symbol when copying to CUDA constant memory

I have several files for an app in image processing. As the number of rows and colums for an image does not change while doing some image processing algorithm I was trying to put those values in constant memory. My app looks…
BRabbit27
  • 6,333
  • 17
  • 90
  • 161
5
votes
1 answer

CUDA pow function with integer arguments

I'm new in CUDA, and cannot understand what I'm doing wrong. I'm trying to calculate the distance of object it has id in array, axis x in array and axis y in array to find neighbors for each object __global__ void dist(int *id_d, int *x_d, int…
Alamin
  • 65
  • 1
  • 5
5
votes
3 answers

Remote debugging and profiling of CUDA program running on Linux server

This is my scenario. I program my CUDA application on windows machine. I compile and run this application on remote linux (Debian) server (without graphical output) using putty. I want to ask what is the best way to debug and profile my…
stuhlo
  • 1,479
  • 9
  • 17
5
votes
1 answer

Strange result of SURF_GPU and BruteForceMatcher_GPU with knnMatch

OpenCV 2.4.5, CUDA 5.0 I tried to transfer my SURF matcher from the CPU to the GPU and got such a strange result. I use knnMatch and findHomography + perspectiveTransform together with my function, which checks the corners of the bounding box for…
iGriffer
  • 250
  • 4
  • 16
5
votes
2 answers

The behavior of __CUDA_ARCH__ macro

In the host code, it seems that the __CUDA_ARCH__ macro wont generate different code path, instead, it will generate code for exact the code path for the current device. However, if __CUDA_ARCH__ were within device code, it will generate different…
user0002128
  • 2,785
  • 2
  • 23
  • 40
5
votes
1 answer

XCode and CUDA integration

Was just wondering if anyone has any experience working with CUDA and XCode? I'm having a nightmare setting it all up... Dawson
Ljdawson
  • 12,091
  • 11
  • 45
  • 60
5
votes
2 answers

How to display pixel arrays in GPU global memory onto screen directly?

I'm doing a path tracer on GPU, and I got some traced results of pixel data (which is an array of float3) on GPU global memory, what I do to display the array on screen is to copy the array to CPU memory and call OpenGL glTexImage2D: glTexImage2D…
Tony
  • 127
  • 1
  • 6
5
votes
1 answer

CUDA Kernels Randomly Fail, but only when I use certain transcendental functions

I've been working on a CUDA program, that randomly crashes with a unspecified launch failure, fairly frequently. Through careful debugging, I localized which kernel was failing, and furthermore that the failure occurred only if certain…
njohn5188
  • 51
  • 2