Questions tagged [cuda]

CUDA (Compute Unified Device Architecture) is a parallel computing platform and programming model for NVIDIA GPUs (Graphics Processing Units). CUDA provides an interface to NVIDIA GPUs through a variety of programming languages, libraries, and APIs.

CUDA is Nvidia's parallel computing platform and programming model for GPUs (Graphics Processing Units). CUDA provides an interface to Nvidia GPUs through a variety of programming languages, libraries, and APIs. Before posting CUDA questions, please read "How to get useful answers to your CUDA questions" below.

CUDA has an online documentation repository, updated with each release, including references for APIs and libraries; user guides for applications; and a detailed CUDA C/C++ Programming Guide.

The CUDA platform enables application development using several languages and associated APIs, including:

There also exist third-party bindings for using CUDA in other languages and programming environments, such as Managed CUDA for .NET languages (including C#).

You should ask questions about CUDA here on Stack Overflow, but if you have bugs to report you should discuss them on the CUDA forums or report them via the registered developer portal. You may want to cross-link to any discussion here on SO.

The CUDA execution model is not multithreading in the usual sense, so please do not tag CUDA questions with unless your question involves thread safety of the CUDA APIs, or the use of both normal CPU multithreading and CUDA together.

How to get useful answers to your CUDA questions

Here are a number of suggestions to users new to CUDA. Follow these suggestions before asking your question and you are much more likely to get a satisfactory answer!

  • Always check the result codes returned by CUDA API functions to ensure you are getting cudaSuccess. If you are not, and you don't know why, include the information about the error in your question. This includes checking for errors caused by the most recent kernel launch, which may not be available before you've called cudaDeviceSynchronize() or cudaStreamSynchronize(). More on checking for errors in CUDA in this question.
  • If you are getting unspecified launch failure it is possible that your code is causing a segmentation fault, meaning the code is accessing memory that is not allocated for the code to use. Try to verify that the indexing is correct and check if the CUDA Compute Sanitizer (or legacy cuda-memcheck on older GPUs until CUDA 12) is reporting any errors. Note that both tools encompass more than the default Memcheck. Other tools (Racecheck, Initcheck, Synccheck) must be selected explicitly.
  • The debugger for CUDA, , is also very useful when you are not really sure what you are doing. You can monitor resources by warp, thread, block, SM and grid level. You can follow your program's execution. If a segmentation fault occurs in your program, can help you find where the crash occurred and see what the context is. If you prefer a GUI for debugging, there are IDE plugins/editions for/of Visual Studio (Windows), Visual Studio Code (Windows/Mac/Linux, but GPU for debugging must be on a Linux system) and Eclipse (Linux).
  • If you are finding that you are getting syntax errors on CUDA keywords when compiling device code, make sure you are compiling using nvcc (or clang with CUDA support enabled) and that your source file has the expected .cu extension. If you find that CUDA device functions or feature namespaces you expect to work are not found (atomic functions, warp voting functions, half-precision arithmetic, cooperative groups, etc.), ensure that you are explicitly passing compilation arguments which enable architecture settings which support those features.

Books

14278 questions
5
votes
1 answer

Miscellaneous and Inter-Thread Communication Instructions in CUDA

I've been playing around with the NVIDIA profiler (nvprof) and there are two particular metrics which I do not understand: inst_inter_thread_communication Number of inter-thread communication instructions executed by non-predicated…
squirem
  • 227
  • 1
  • 11
5
votes
1 answer

Is CUDA warp scheduling deterministic?

I am wondering if the warp scheduling order of a CUDA application is deterministic. Specifically I am wondering if the ordering of warp execution will stay the same with multiple runs of the same kernel with the same input data on the same device.…
NothingMore
  • 1,211
  • 9
  • 19
5
votes
1 answer

Separating even and odd numbers in CUDA

I have an array of numbers as {1,2,3,4,5,6,7,8,9,10} and I want to separate even and odd numbers as: even = {2,4,6,8} and: odd = {1,3,5,7} I am aware of atomic operations in CUDA, and also aware that the output is not expected to suffer from race…
Laxmi Kadariya
  • 1,103
  • 1
  • 14
  • 34
5
votes
1 answer

How to properly link cuda header file with device functions?

I'm trying to decouple my code a bit and something fails. Compilation error: error: calling a __host__ function("DecoupledCallGpu") from a __global__ function("kernel") is not allowed Code excerpt: main.c (has a call to cuda host…
Denys S.
  • 6,358
  • 12
  • 43
  • 65
5
votes
2 answers

How to measure overhead of a kernel launch in CUDA

I want to measure the overhead of a kernel launch in CUDA. I understand that there are various parameters which affect this overhead. I am interested in the following: number of threads created size of data being copied I am doing this mainly to…
pranith
  • 869
  • 9
  • 24
5
votes
3 answers

GPU programming - transfer bottlenecks

As I would like my GPU to do some of calculation for me, I am interested in the topic of measuring a speed of 'texture' upload and download - because my 'textures' are the data that GPU should crunch. I know that transfer from main memory to GPU…
Daniel Mošmondor
  • 19,718
  • 12
  • 58
  • 99
5
votes
1 answer

Hello World CUDA compilation issues

I'm using the CUDA by Example book and attempting to compile the first real example in the book. I'm on OSX 10.9.2: My source is: @punk ~/Documents/Projects/CUDA$ /Developer/NVIDIA/CUDA-6.0/bin/nvcc hello.c nvcc warning : The 'compute_10' and…
mr-sk
  • 13,174
  • 11
  • 66
  • 101
5
votes
3 answers

Meaning of bandwidth in CUDA and why it is important

The CUDA programming guide states that "Bandwidth is one of the most important gating factors for performance. Almost all changes to code should be made in the context of how they affect bandwidth." It goes on to calculate theoretical bandwidth…
zenna
  • 9,006
  • 12
  • 73
  • 101
5
votes
1 answer

CUDA Double pointer memory copy

I wrote my sample code like this. int ** d_ptr; cudaMalloc( (void**)&d_ptr, sizeof(int*)*N ); int* tmp_ptr[N]; for(int i=0; i
Umbrella
  • 475
  • 3
  • 9
  • 19
5
votes
1 answer

How do the warps schedule on CUDA SMs?

As the answer of this question shows, when a SM contains 8 CUDA cores(Compute Capability 1.3), a single warp of 32 threads takes 4 clock cycles to execute a single instruction for the whole warp. That is lane 1 to lane 8 of the warp concurrently…
Danny Zhu
  • 51
  • 2
5
votes
1 answer

Summing the elements with even or odd indices by CUDA Thrust

If I use float sum = thrust::transform_reduce(d_a.begin(), d_a.end(), conditional_operator(), 0.f, thrust::plus()); I get the sum of all elements meeting a condition provided by conditional_operator(), as in Conditional reduction in…
Roshan
  • 548
  • 1
  • 3
  • 11
5
votes
1 answer

Conditional reduction in CUDA

I need to sum about 100000 values stored in an array, but with conditions. Is there a way to do that in CUDA to produce fast results? Can anyone post a small code to do that?
Roshan
  • 548
  • 1
  • 3
  • 11
5
votes
1 answer

Incompatibility error installing CUDA on Windows

I am on Windows 8.1 Pro and I want to install CUDA 5.5. I have installed Visual Studio 2013 already and I have the latest GPU driver's version 335.23. In the NVIDIA control panel I have also set CUDA - GPUs to GeForce GT 740M. My CPU is Intel Core…
Amir
  • 10,600
  • 9
  • 48
  • 75
5
votes
2 answers

Polymorphism and derived classes in CUDA / CUDA Thrust

This is my first question on Stack Overflow, and it's quite a long question. The tl;dr version is: How do I work with a thrust::device_vector if I want it to store objects of different types DerivedClass1, DerivedClass2, etc,…
user3519303
  • 75
  • 1
  • 6
5
votes
2 answers

Shared Memory Bank Conflicts in CUDA: How memory is aligned to banks

As far as my understanding goes, shared memory is divided into banks and accesses by multiple threads to a single data element within the same bank will cause a conflict (or broadcast). At the moment I allocate a fairly large array which…
zenna
  • 9,006
  • 12
  • 73
  • 101