CUDA streams are the hardware-supported queues on CUDA GPUs through which work is scheduled (kernel launches, memory transfers etc.)
Questions tagged [cuda-streams]
78 questions
1
vote
2 answers
CUDA 4.0 RC - many host threads per one GPU - cudaStreamQuery and cudaStreamSynchronize behaviour
I wrote a code which uses many host (OpenMP) threads per one GPU. Each thread has its own CUDA stream to order it requests. It looks very similar to below code:
#pragma omp parallel for num_threads(STREAM_NUMBER)
for (int sid = 0; sid <…

kokosing
- 5,251
- 5
- 37
- 50
1
vote
1 answer
How big is a cudaStream_t?
I have inherited some code that basically does stuff like this:
void *stream;
cudaStreamCreate((cudaStream_t *)&stream);
Looking at targets/x86_64-linux/driver_types.h for CUDA 8, I see:
typedef __device_builtin__ struct CUStream_st…

Ken Y-N
- 14,644
- 21
- 71
- 114
1
vote
2 answers
Why do cudaMemcpyAsync and kernel launches block even with an asynchronous stream?
Consider the following program for enqueueing some work on a non-blocking GPU stream:
#include
using clock_value_t = long long;
__device__ void gpu_sleep(clock_value_t sleep_cycles) {
clock_value_t start = clock64();
…

einpoklum
- 118,144
- 57
- 340
- 684
1
vote
1 answer
CUDA streams are blocking despite Async
I'm working on a video stream in real time that I try to process with a GeForce GTX 960M. (Windows 10, VS 2013, CUDA 8.0)
Each frame has to be captured, lightly blured, and whenever I can, I need to do some hard-work calculations on the 10 latest…

Charlie Echo
- 87
- 1
- 5
1
vote
1 answer
CUDA FFT plan reuse across multiple 'overlapped' CUDA Stream launches
I'm in trying to improve the performance of my code using asynchronous memory transfer overlapped with GPU computation.
Formerly I had a code where I created an FFT plan, and then make use of it multiple times. In such situation the time invested…

Omar Valerio
- 43
- 7
1
vote
1 answer
The behavior of stream 0 (default) and other streams
In CUDA, how is stream 0 related to other streams? Does stream 0 (default stream) execute concurrently with other streams in a context or not?
Considering the following example:
cudaMemcpy(Dst, Src, sizeof(float)*datasize,…

user2188453
- 1,105
- 1
- 12
- 26
1
vote
1 answer
Global Memory and CUDA streams
I'm working on CUDA and I have a doubt about global memory and streams CUDA.
Let:
__device__ float Aux[32];
__global__ void kernel1(...) {
[...]
Aux[threadIdx.y] = 0;
[...]
}
So, if I run this kernel on different streams GPU. Is Aux the…

userCUDA
- 11
- 2
1
vote
1 answer
Stream scheduling order
The way I see both Process One & Process Two (below), are equivalent in that they take the same amount of time. Am I wrong?
allOfData_A= data_A1 + data_A2
allOfData_B= data_B1 + data_B2
allOFData_C= data_C1 + data_C2
Data_C is the output of the…

Doug
- 2,783
- 6
- 33
- 37
0
votes
0 answers
The cudaMemcpyAsync interaction with pageable host memory
I am beginning to learn cuda programming. In learning the streams and the async/sync features, I have encountered some problems. As said in the Nvidia docs and many sources, the cudaMemcpyAsync can be used to realize the overlapping of data transfer…

CabbageHuge
- 1
- 1
0
votes
0 answers
Gstreamer create custom CUDA plugin
Want to implement a custom plugin which process only the GPU frames (memory:CUDAMemory) and also update the frame (Consider creating an overlay on the video).
$./gst-launch-1.0 videotestsrc ! cudaupload ! 'video/x-raw(memory:CUDAMemory)' !…

Pankaj Buddhe
- 1
- 2
0
votes
1 answer
What does CU_MEMPOOL_ATTR_REUSE_ALLOW_OPPORTUNISTIC actually allow?
One of the attributes of CUDA memory pools is CU_MEMPOOL_ATTR_REUSE_ALLOW_OPPORTUNISTIC, described in the doxygen as follows:
Allow reuse of already completed frees when there is no dependency between the free and allocation.
If a free (a…

einpoklum
- 118,144
- 57
- 340
- 684
0
votes
1 answer
Is it possible to execute more than one CUDA graph's host execution node in different streams concurrently?
Investigating possible solutions for this problem, I thought about using CUDA graphs' host execution nodes (cudaGraphAddHostNode). I was hoping to have the option to block and unblock streams on the host side instead of the device side with the wait…

surabax
- 15
- 5
0
votes
1 answer
What is cuEventRecord guaranteed to do if it gets the default-stream's handle?
Suppose I call cuEventRecord(0, my_event_handle).
cuEventRecord() requires the stream and the event to belong to the same context. Now, one can interpret the 0 as "the default stream in the appropriate context" - the requirements are satisfied and…

einpoklum
- 118,144
- 57
- 340
- 684
0
votes
1 answer
How can I make sure two kernels in two streams are sent to the GPU at the same time to run?
I am beginner in CUDA. I am using NVIDIA Geforce GTX 1070 and CUDA toolkit 11.3 and ubuntu 18.04.
As shown in the code below, I use two CPU threads to send two kernels in the form of two streams to a GPU. I want exactly these two kernels to be sent…

mehran
- 191
- 10
0
votes
1 answer
Reusing cudaEvent to serialize multiple streams
Suppose I have a struct:
typedef enum {ON_CPU,ON_GPU,ON_BOTH} memLocation;
typedef struct foo *foo;
struct foo {
cudaEvent_t event;
float *deviceArray;
float *hostArray;
memLocation arrayLocation;
};
a function:
void…

Jacob Faib
- 1,062
- 7
- 22