Questions tagged [cuda]

CUDA (Compute Unified Device Architecture) is a parallel computing platform and programming model for NVIDIA GPUs (Graphics Processing Units). CUDA provides an interface to NVIDIA GPUs through a variety of programming languages, libraries, and APIs.

CUDA is Nvidia's parallel computing platform and programming model for GPUs (Graphics Processing Units). CUDA provides an interface to Nvidia GPUs through a variety of programming languages, libraries, and APIs. Before posting CUDA questions, please read "How to get useful answers to your CUDA questions" below.

CUDA has an online documentation repository, updated with each release, including references for APIs and libraries; user guides for applications; and a detailed CUDA C/C++ Programming Guide.

The CUDA platform enables application development using several languages and associated APIs, including:

There also exist third-party bindings for using CUDA in other languages and programming environments, such as Managed CUDA for .NET languages (including C#).

You should ask questions about CUDA here on Stack Overflow, but if you have bugs to report you should discuss them on the CUDA forums or report them via the registered developer portal. You may want to cross-link to any discussion here on SO.

The CUDA execution model is not multithreading in the usual sense, so please do not tag CUDA questions with multithreading unless your question involves thread safety of the CUDA APIs, or the use of both normal CPU multithreading and CUDA together.

How to get useful answers to your CUDA questions

Here are a number of suggestions to users new to CUDA. Follow these suggestions before asking your question and you are much more likely to get a satisfactory answer!

Always check the result codes returned by CUDA API functions to ensure you are getting cudaSuccess. If you are not, and you don't know why, include the information about the error in your question. This includes checking for errors caused by the most recent kernel launch, which may not be available before you've called cudaDeviceSynchronize() or cudaStreamSynchronize(). More on checking for errors in CUDA in this question.
If you are getting unspecified launch failure it is possible that your code is causing a segmentation fault, meaning the code is accessing memory that is not allocated for the code to use. Try to verify that the indexing is correct and check if the CUDA Compute Sanitizer (or legacy cuda-memcheck on older GPUs until CUDA 12) is reporting any errors. Note that both tools encompass more than the default Memcheck. Other tools (Racecheck, Initcheck, Synccheck) must be selected explicitly.
The debugger for CUDA, cuda-gdb, is also very useful when you are not really sure what you are doing. You can monitor resources by warp, thread, block, SM and grid level. You can follow your program's execution. If a segmentation fault occurs in your program, cuda-gdb can help you find where the crash occurred and see what the context is. If you prefer a GUI for debugging, there are IDE plugins/editions for/of Visual Studio (Windows), Visual Studio Code (Windows/Mac/Linux, but GPU for debugging must be on a Linux system) and Eclipse (Linux).
If you are finding that you are getting syntax errors on CUDA keywords when compiling device code, make sure you are compiling using nvcc (or clang with CUDA support enabled) and that your source file has the expected .cu extension. If you find that CUDA device functions or feature namespaces you expect to work are not found (atomic functions, warp voting functions, half-precision arithmetic, cooperative groups, etc.), ensure that you are explicitly passing compilation arguments which enable architecture settings which support those features.

Books

14278 questions

votes

1 answer

Miscellaneous and Inter-Thread Communication Instructions in CUDA

I've been playing around with the NVIDIA profiler (nvprof) and there are two particular metrics which I do not understand: inst_inter_thread_communication Number of inter-thread communication instructions executed by non-predicated…

cuda nvidia profiler instructions

asked Sep 04 '14 at 16:42

squirem

votes

1 answer

Is CUDA warp scheduling deterministic?

I am wondering if the warp scheduling order of a CUDA application is deterministic. Specifically I am wondering if the ordering of warp execution will stay the same with multiple runs of the same kernel with the same input data on the same device.…

cuda gpu-warp

asked Jul 27 '14 at 02:13

NothingMore

1,211
9
19

votes

1 answer

Separating even and odd numbers in CUDA

I have an array of numbers as {1,2,3,4,5,6,7,8,9,10} and I want to separate even and odd numbers as: even = {2,4,6,8} and: odd = {1,3,5,7} I am aware of atomic operations in CUDA, and also aware that the output is not expected to suffer from race…

cuda

asked Jul 10 '14 at 06:11

Laxmi Kadariya

1,103
1
14
34

votes

1 answer

How to properly link cuda header file with device functions?

I'm trying to decouple my code a bit and something fails. Compilation error: error: calling a __host__ function("DecoupledCallGpu") from a __global__ function("kernel") is not allowed Code excerpt: main.c (has a call to cuda host…

c++ cuda linker gpgpu nvidia

asked Jun 27 '14 at 19:29

Denys S.

6,358
12
43
65

votes

2 answers

How to measure overhead of a kernel launch in CUDA

I want to measure the overhead of a kernel launch in CUDA. I understand that there are various parameters which affect this overhead. I am interested in the following: number of threads created size of data being copied I am doing this mainly to…

cuda

asked Jun 23 '14 at 22:13

pranith

votes

3 answers

GPU programming - transfer bottlenecks

As I would like my GPU to do some of calculation for me, I am interested in the topic of measuring a speed of 'texture' upload and download - because my 'textures' are the data that GPU should crunch. I know that transfer from main memory to GPU…

benchmarking cuda gpu

asked Mar 10 '10 at 18:09

Daniel Mošmondor

19,718
12
58
99

votes

1 answer

Hello World CUDA compilation issues

I'm using the CUDA by Example book and attempting to compile the first real example in the book. I'm on OSX 10.9.2: My source is: @punk ~/Documents/Projects/CUDA$ /Developer/NVIDIA/CUDA-6.0/bin/nvcc hello.c nvcc warning : The 'compute_10' and…

c++ c macos cuda nvcc

asked May 29 '14 at 23:03

mr-sk

13,174
11
66
101

votes

3 answers

Meaning of bandwidth in CUDA and why it is important

The CUDA programming guide states that "Bandwidth is one of the most important gating factors for performance. Almost all changes to code should be made in the context of how they affect bandwidth." It goes on to calculate theoretical bandwidth…

optimization memory cuda bandwidth

asked Mar 04 '10 at 17:25

zenna

9,006
12
73
101

votes

1 answer

CUDA Double pointer memory copy

I wrote my sample code like this. int ** d_ptr; cudaMalloc( (void**)&d_ptr, sizeof(int*)*N ); int* tmp_ptr[N]; for(int i=0; i

cuda

asked May 12 '14 at 12:59

Umbrella

votes

1 answer

How do the warps schedule on CUDA SMs?

As the answer of this question shows, when a SM contains 8 CUDA cores（Compute Capability 1.3）, a single warp of 32 threads takes 4 clock cycles to execute a single instruction for the whole warp. That is lane 1 to lane 8 of the warp concurrently…

cuda

asked May 08 '14 at 02:43

Danny Zhu

votes

1 answer

Summing the elements with even or odd indices by CUDA Thrust

If I use float sum = thrust::transform_reduce(d_a.begin(), d_a.end(), conditional_operator(), 0.f, thrust::plus()); I get the sum of all elements meeting a condition provided by conditional_operator(), as in Conditional reduction in…

cuda sum thrust

asked Apr 30 '14 at 04:27

Roshan

votes

1 answer

Conditional reduction in CUDA

I need to sum about 100000 values stored in an array, but with conditions. Is there a way to do that in CUDA to produce fast results? Can anyone post a small code to do that?

performance cuda synchronization sum

asked Apr 28 '14 at 07:17

Roshan

votes

1 answer

Incompatibility error installing CUDA on Windows

I am on Windows 8.1 Pro and I want to install CUDA 5.5. I have installed Visual Studio 2013 already and I have the latest GPU driver's version 335.23. In the NVIDIA control panel I have also set CUDA - GPUs to GeForce GT 740M. My CPU is Intel Core…

windows cuda installation gpgpu nvidia

asked Apr 13 '14 at 00:29

Amir

10,600
9
48
75

votes

2 answers

Polymorphism and derived classes in CUDA / CUDA Thrust

This is my first question on Stack Overflow, and it's quite a long question. The tl;dr version is: How do I work with a thrust::device_vector if I want it to store objects of different types DerivedClass1, DerivedClass2, etc,…

cuda polymorphism thrust

asked Apr 10 '14 at 12:35

user3519303

votes

2 answers

Shared Memory Bank Conflicts in CUDA: How memory is aligned to banks

As far as my understanding goes, shared memory is divided into banks and accesses by multiple threads to a single data element within the same bank will cause a conflict (or broadcast). At the moment I allocate a fairly large array which…

optimization cuda gpu-shared-memory

asked Feb 17 '10 at 19:10

zenna

9,006
12
73
101

Prev 1 2 3

…

99 100 Next