Questions tagged [cuda]

CUDA (Compute Unified Device Architecture) is a parallel computing platform and programming model for NVIDIA GPUs (Graphics Processing Units). CUDA provides an interface to NVIDIA GPUs through a variety of programming languages, libraries, and APIs.

CUDA is Nvidia's parallel computing platform and programming model for GPUs (Graphics Processing Units). CUDA provides an interface to Nvidia GPUs through a variety of programming languages, libraries, and APIs. Before posting CUDA questions, please read "How to get useful answers to your CUDA questions" below.

CUDA has an online documentation repository, updated with each release, including references for APIs and libraries; user guides for applications; and a detailed CUDA C/C++ Programming Guide.

The CUDA platform enables application development using several languages and associated APIs, including:

There also exist third-party bindings for using CUDA in other languages and programming environments, such as Managed CUDA for .NET languages (including C#).

You should ask questions about CUDA here on Stack Overflow, but if you have bugs to report you should discuss them on the CUDA forums or report them via the registered developer portal. You may want to cross-link to any discussion here on SO.

The CUDA execution model is not multithreading in the usual sense, so please do not tag CUDA questions with multithreading unless your question involves thread safety of the CUDA APIs, or the use of both normal CPU multithreading and CUDA together.

How to get useful answers to your CUDA questions

Here are a number of suggestions to users new to CUDA. Follow these suggestions before asking your question and you are much more likely to get a satisfactory answer!

Always check the result codes returned by CUDA API functions to ensure you are getting cudaSuccess. If you are not, and you don't know why, include the information about the error in your question. This includes checking for errors caused by the most recent kernel launch, which may not be available before you've called cudaDeviceSynchronize() or cudaStreamSynchronize(). More on checking for errors in CUDA in this question.
If you are getting unspecified launch failure it is possible that your code is causing a segmentation fault, meaning the code is accessing memory that is not allocated for the code to use. Try to verify that the indexing is correct and check if the CUDA Compute Sanitizer (or legacy cuda-memcheck on older GPUs until CUDA 12) is reporting any errors. Note that both tools encompass more than the default Memcheck. Other tools (Racecheck, Initcheck, Synccheck) must be selected explicitly.
The debugger for CUDA, cuda-gdb, is also very useful when you are not really sure what you are doing. You can monitor resources by warp, thread, block, SM and grid level. You can follow your program's execution. If a segmentation fault occurs in your program, cuda-gdb can help you find where the crash occurred and see what the context is. If you prefer a GUI for debugging, there are IDE plugins/editions for/of Visual Studio (Windows), Visual Studio Code (Windows/Mac/Linux, but GPU for debugging must be on a Linux system) and Eclipse (Linux).
If you are finding that you are getting syntax errors on CUDA keywords when compiling device code, make sure you are compiling using nvcc (or clang with CUDA support enabled) and that your source file has the expected .cu extension. If you find that CUDA device functions or feature namespaces you expect to work are not found (atomic functions, warp voting functions, half-precision arithmetic, cooperative groups, etc.), ensure that you are explicitly passing compilation arguments which enable architecture settings which support those features.

Books

14278 questions

votes

2 answers

Reducing matrix rows or columns in CUDA

I'm using CUDA with cuBLAS to perform matrix operations. I need to sum the rows (or columns) of a matrix. Currently I'm doing it by multiplying the matrix with a ones vector but this doesn't seem so efficient. Is there any better way? Couldn't find…

cuda cublas

asked Jan 10 '13 at 15:07

Ran

4,117
4
44
70

votes

4 answers

OpenCL examples with benchmarks

I'm looking for some introductory examples to OpenCL which illustrate the types of applications that can experience large (e.g., 50x-1000x) increases in speed. Cuda has lots of nice examples, but I haven't found the same thing for OpenCL. A nice…

cuda opencl

asked Sep 14 '09 at 22:41

Tristan

6,776
5
40
63

votes

1 answer

JPEG library in CUDA

I am trying compress and decompress images in CUDA. So far I've found this library: http://sourceforge.net/projects/cuj2k/?source=navbar But there isn't much documentation available. Does anyone know about any well documented (with example code)…

cuda jpeg compression

asked Jan 04 '13 at 03:48

rootcage

votes

3 answers

CUDA nvcc compiler setup Ubuntu 12.04

I successfully installed the nvidia driver and toolkit for cuda 5 (but not the samples) on a 64 bit Ubuntu 12.04 box. The samples failed to install even though I previously ran $ sudo apt-get install freeglut3-dev build-essential libx11-dev…

cuda nvcc

asked Dec 22 '12 at 00:12

andandandand

21,946
60
170
271

votes

2 answers

cudaMalloc always gives out of memory

I'm facing a simple problem, where all my calls to cudaMalloc fail, giving me an out of memory error, even if its just a single byte I'm allocating. The cuda device is available and there is also a lot of memory available (bot checked with the…

cuda out-of-memory

asked Dec 18 '12 at 20:03

Sleeme

votes

2 answers

cuda 5.0 dynamic parallelism error: ptxas fatal . unresolved extern function 'cudaLaunchDevice

I am using tesla k20 with compute capability 35 on Linux with CUDA 5.With a simple child kernel call it gives a compile error : Unresolved extern function cudaLaunchDevice My command line looks like: nvcc --compile -G -O0 -g -gencode arch=compute_35…

cuda parallel-processing gpu

asked Dec 15 '12 at 02:36

Zahid

votes

1 answer

How do I avoid access violation exception calling a CUDA Dll?

I'm new with CUDA and not really familiar with C either. I wrote a Dll to implement CUDA methods (FFT) into my C# programm. I debugged first the dll as a console application to make sure it works properly, and just then built it as a dll. So my…

c# dll cuda unmanaged access-violation

asked Dec 13 '12 at 11:19

Hodossy Szabolcs

1,598
3
18
34

votes

2 answers

How do I retrieve the parameter list information for a CUDA 4.0+ kernel?

According to the NVidia documentation for the cuLaunchKernel function, kernels compiled with CUDA 3.2+ contain information regarding their parameter list. Is there a way to retrieve this information programmatically from a CUfunction handle? I need…

cuda

asked Dec 10 '12 at 05:11

reirab

1,535
14
32

votes

4 answers

How to transpose a matrix in CUDA/cublas?

Say I have a matrix with a dimension of A*B on GPU, where B (number of columns) is the leading dimension assuming a C style. Is there any method in CUDA (or cublas) to transpose this matrix to FORTRAN style, where A (number of rows) becomes the…

c parallel-processing cuda gpu cublas

asked Dec 08 '12 at 21:15

Hailiang Zhang

17,604
23
71
117

votes

3 answers

CUDA performance improves when running more threads than there are cores

Why does performance improve when I run more than 32 threads per block? My graphics card has 480 CUDA Cores (15 MS * 32 SP).

c++ cuda opencl

asked Dec 07 '12 at 14:46

user1885750

votes

1 answer

cuBLAS argmin -- segfault if outputing to device memory?

In cuBLAS, cublasIsamin() gives the argmin for a single-precision array. Here's the full function declaration: cublasStatus_t cublasIsamin(cublasHandle_t handle, int n, const float *x, int incx, int…

cuda gpu gpgpu blas cublas

asked Dec 05 '12 at 07:24

solvingPuzzles

8,541
16
69
112

votes

2 answers

Will 32 threads from 32 block be scheduled as a warp?

I understand that in CUDA, 32 adjacent threads in the same block will be scheduled as a warp. But I frequently finds some tutorial CUDA codes that has multiple blocks with 1 thread per block. In this model, will 32 threads from 32 block be scheduled…

cuda

asked Dec 04 '12 at 02:47

Hailiang Zhang

17,604
23
71
117

votes

1 answer

CUDA result returns garbage using very large array, but reports no error

I am creating a test program that will create a device and a host array of size n and then launch a kernel that creates n threads which allocate the constant value 0.95f to each location in the device array. After completion, the device array is…

c++ c cuda nvidia

asked Nov 23 '12 at 15:38

TVOHM

2,740
1
19
29

votes

1 answer

nvcc: command not found

I installed cuda sdk 5.0 to /opt and even compiled all examples, but I can't execute nvcc. Here is some console output: I'm using linux mint 13.

cuda nvcc

asked Nov 22 '12 at 20:16

user983302

1,377
3
14
23

votes

1 answer

C++ and CUDA: why does the code return different results each time?

Update: I found the bug. Since the code I posted before is very complicated, I simplify them and only keep the part when the problem is. if (number >= dim * num_points) return; But actually, I only have num_points, I want to use num_points…

c++ cuda

asked Nov 19 '12 at 08:07

user1834981

Prev 1 2 3

…

99 100 Next