Questions tagged [cuda]

CUDA (Compute Unified Device Architecture) is a parallel computing platform and programming model for NVIDIA GPUs (Graphics Processing Units). CUDA provides an interface to NVIDIA GPUs through a variety of programming languages, libraries, and APIs.

CUDA is Nvidia's parallel computing platform and programming model for GPUs (Graphics Processing Units). CUDA provides an interface to Nvidia GPUs through a variety of programming languages, libraries, and APIs. Before posting CUDA questions, please read "How to get useful answers to your CUDA questions" below.

CUDA has an online documentation repository, updated with each release, including references for APIs and libraries; user guides for applications; and a detailed CUDA C/C++ Programming Guide.

The CUDA platform enables application development using several languages and associated APIs, including:

There also exist third-party bindings for using CUDA in other languages and programming environments, such as Managed CUDA for .NET languages (including C#).

You should ask questions about CUDA here on Stack Overflow, but if you have bugs to report you should discuss them on the CUDA forums or report them via the registered developer portal. You may want to cross-link to any discussion here on SO.

The CUDA execution model is not multithreading in the usual sense, so please do not tag CUDA questions with multithreading unless your question involves thread safety of the CUDA APIs, or the use of both normal CPU multithreading and CUDA together.

How to get useful answers to your CUDA questions

Here are a number of suggestions to users new to CUDA. Follow these suggestions before asking your question and you are much more likely to get a satisfactory answer!

Always check the result codes returned by CUDA API functions to ensure you are getting cudaSuccess. If you are not, and you don't know why, include the information about the error in your question. This includes checking for errors caused by the most recent kernel launch, which may not be available before you've called cudaDeviceSynchronize() or cudaStreamSynchronize(). More on checking for errors in CUDA in this question.
If you are getting unspecified launch failure it is possible that your code is causing a segmentation fault, meaning the code is accessing memory that is not allocated for the code to use. Try to verify that the indexing is correct and check if the CUDA Compute Sanitizer (or legacy cuda-memcheck on older GPUs until CUDA 12) is reporting any errors. Note that both tools encompass more than the default Memcheck. Other tools (Racecheck, Initcheck, Synccheck) must be selected explicitly.
The debugger for CUDA, cuda-gdb, is also very useful when you are not really sure what you are doing. You can monitor resources by warp, thread, block, SM and grid level. You can follow your program's execution. If a segmentation fault occurs in your program, cuda-gdb can help you find where the crash occurred and see what the context is. If you prefer a GUI for debugging, there are IDE plugins/editions for/of Visual Studio (Windows), Visual Studio Code (Windows/Mac/Linux, but GPU for debugging must be on a Linux system) and Eclipse (Linux).
If you are finding that you are getting syntax errors on CUDA keywords when compiling device code, make sure you are compiling using nvcc (or clang with CUDA support enabled) and that your source file has the expected .cu extension. If you find that CUDA device functions or feature namespaces you expect to work are not found (atomic functions, warp voting functions, half-precision arithmetic, cooperative groups, etc.), ensure that you are explicitly passing compilation arguments which enable architecture settings which support those features.

Books

14278 questions

votes

1 answer

3D Convolution with CUDA using shared memory

I'm currently trying to adapt the 2D convolution code from THIS question to 3D and having trouble trying to understand where my error is. My 2D Code looks like this: #include #define MASK_WIDTH 3 #define MASK_RADIUS …

c++ cuda gpu-shared-memory

asked Mar 22 '14 at 12:53

Schnigges

1,284
2
24
48

votes

2 answers

Displaying CUDA-processed images in WPF

I have a WPF application that acquires images from a camera, processes these images, and displays them. The processing part has become burdensome for the CPU, so I've looked at moving this processing to the GPU and running custom CUDA kernels…

wpf image-processing cuda

asked Mar 06 '14 at 16:35

Bryan Greenway

votes

5 answers

Operations on arbitrary value types

This article describes a way, in C#, to allow the addition of arbitrary value types which have a + operator defined for them. In essence it allows the following code: public T Add(T val1, T val2) { return val1 + val2; } This code does not…

c++ c cuda gpgpu value-type

asked Oct 21 '08 at 13:50

Morten Christiansen

19,002
22
69
94

votes

1 answer

CUDA Dynamic Parallelism MakeFile

This is my first program using Dynamic Parallelism and I am unable to compile the code. I need to be able to run this for my research project at college and any help will be most appreciated: I get the following…

c cuda

asked Feb 27 '14 at 17:35

Kanishk Kanoria

votes

1 answer

CUDA: bank conflicts between different warps?

I just learned (from Why only one of the warps is executed by a SM in cuda?) that Kepler GPUs can actually execute instructions from several (apparently 4) warps at once. Can a shared memory bank also serve four requests at once? If not, that would…

cuda gpu-shared-memory bank-conflict

asked Feb 15 '14 at 19:22

user3314215

votes

2 answers

thrust reduction result on device memory

Is it possible to leave the return value of a thrust::reduce operation in device-allocated memory? In case it is, is it just as easy as assigning the value to a cudaMalloc'ed area, or should I use a thrust::device_ptr?

cuda reduce thrust

asked Feb 13 '14 at 17:34

Orgrim

votes

2 answers

How to use thrust min_element algorithm without memcpys between device and host

I am optimising a pycuda / thrust program. In it, I use thrust::min_element to identify the index of the minimum element in an array that is on the device. Using Nvidia's visual profiler, it appears that whenever I call thrust::min_element, there…

cuda thrust

asked Jan 30 '14 at 14:27

weemattisnot

votes

1 answer

CUDA same function for CPU and GPU

In order to call the same function from host code and GPU kernel, Do I have to keep the two copies of the same function as below: int sum(int a, int b){ return a+b; } __device int sumGPU(int a, int b){ return a+b; } Or is there any technique to…

c++ c cuda gpgpu

asked Jan 14 '14 at 12:24

Imran

votes

1 answer

How to integrate CUDA .cu code with C++ app

This post closely resembles my earlier post: How to separate CUDA code into multiple files I am afraid I made such a blunder of what I was actually asking that it will be too confusing to try and correct it there. I am basing this code loosely off…

c++ visual-studio-2008 cuda

asked Jan 20 '10 at 02:12

Mr Bell

9,228
18
84
134

votes

1 answer

External calls are not supported - CUDA

Objective is to call a device function available in another file, when i compile the global kernel it shows the following error *External calls are not supported (found non-inlined call to _Z6GoldenSectionCUDA)*. Problematic Code (not the full code…

c++ cuda gpgpu nvidia

asked Jan 02 '14 at 13:55

Itachi

1,383
11
22

votes

1 answer

CUDA atomic operations and concurrent kernel launch

Currently I develop a GPU-based program that use multiple kernels that are launched concurrently by using multiple streams. In my application, multiple kernels need to access a queue/stack and I have plan to use atomic operations. But I do not know…

concurrency cuda gpu-atomics

asked Dec 23 '13 at 08:08

user3128889

votes

1 answer

How does warp work with atomic operation?

The threads in a warp run physically parallel, so if one of them (called, thread X) start an atomic operation, what other will do? Wait? Is it mean, all threads will be waiting while thread X is pushed to the atomic-queue, get the access (mutex) and…

c++ c performance cuda atomic

asked Dec 22 '13 at 04:35

Nexen

1,663
4
16
44

votes

1 answer

Mixing C++ flavours in the same project

Is it safe to mix C++98 and C++11 in the same project? By "mixing" I mean not only linking object files but also common header files included in the source code compiled with C++98 and C++11. The background for the question is the desire to…

c++ c++11 cuda compilation compatibility

asked Dec 05 '13 at 17:22

Michael

5,775
2
34
53

votes

3 answers

Compiling CUDA examples gives build error

I am running Windows 7 64bit, with Visual Studio 2008. I installed the CUDA drivers and SDK. The SDK comes with quite a few examples including compiled executables and source code. The compiled executables run wonderfully. When I open the vc90…

c++ visual-studio-2008 cuda

asked Jan 05 '10 at 13:54

Mr Bell

9,228
18
84
134

votes

1 answer

Difference between memcpy_htod and to_gpu in Pycuda?

I am learning PyCUDA, and while going through the documentation on pycuda.gpuarray, I am puzzled by the difference between pycuda.driver.memcpy_htod (also _dtoh) and pycuda.gpuarray.to_gpu (also get) functions. According to gpuarray documentation,…

numpy cuda gpu pycuda

asked Nov 17 '13 at 18:58

Pippi

2,451
8
39
59

Prev 1 2 3

…

99 100 Next