CUDA streams are the hardware-supported queues on CUDA GPUs through which work is scheduled (kernel launches, memory transfers etc.)
Questions tagged [cuda-streams]
78 questions
3
votes
1 answer
CUDA streams and concurrent kernel execution
I would like to use streams in order to parallelize the execution of kernels that work on separate device data arrays. Data were allocated on the device and filled from previous kernels.
I have written the following program that shows I can't reach…

J. Bailleul
- 51
- 1
- 7
2
votes
1 answer
Why am I unable to establish a pipeline when using multiple concurrent streams in CUDA programming?
I wish to construct a pipeline using multiple streams. Below is the code I have written:
using namespace std;
__global__ void vecAdd(float *c, const float *a, const float *b);
void initBuffer(float *data, int size);
int main() {
int size = 1…

Aitar
- 23
- 4
2
votes
2 answers
Using multi streams in cuda graph, the execution order is uncontrolled
I am using cuda graph stream capture API to implement a small demo with multi streams. Referenced by the CUDA Programming Guide here, I wrote the complete code. In my knowledge, kernelB should execute on stream1, but with nsys I found kernelB is…

poohRui
- 613
- 5
- 9
2
votes
1 answer
How can I pause a CUDA stream and then resume it?
Suppose we have two CUDA streams running two CUDA kernels on a GPU at the same time. How can I pause the CUDA kernel running with the instruction I putting in the host code and resume it with the instruction in the host code?
I have no idea how to…

mehran
- 191
- 10
2
votes
0 answers
Is the profiler wrong, or is the scheduling messed up, or both?
Consider the following program:
#include
#include
#include
using clock_value_t = long long;
__device__ void gpu_sleep(clock_value_t sleep_cycles)
{
clock_value_t start = clock64();
clock_value_t…

einpoklum
- 118,144
- 57
- 340
- 684
2
votes
1 answer
cudaMemcpyAsync execution is delayed for some reason
I am trying to use streams to run H2D copy and kernel run in parallel.
To do this, I created 2 streams with cudaStreamNonBlocking flag.
In a loop I do the following pseudocode:
// pseudocode
cudaMemcpy(data[0]);
streamIdx = 0;
while(1)
{
//…

dmitryvolk
- 331
- 3
- 3
2
votes
1 answer
In CUDA, is it guaranteed that the default stream equals nullptr?
In CUDA's driver_types.h we have:
typedef __device_builtin__ struct CUstream_st *cudaStream_t;
And in cuda_runtime.h we have, in many places, default-initialized stream parameters. For example:
template
static __inline__ __host__…

einpoklum
- 118,144
- 57
- 340
- 684
2
votes
1 answer
Thrust execution policy issues kernel to default stream
I am currently designing a short tutorial exhibiting various aspects and capabilities of Thrust template library.
Unfortunately, it seems that there is a problem in a code that I have written in order to show how to use copy/compute concurrency…

Tobbey
- 469
- 1
- 4
- 18
2
votes
1 answer
Concurrent, unique kernels on the same multiprocessor?
Is it possible, using streams, to have multiple unique kernels on the same streaming multiprocessor in Kepler 3.5 GPUs? I.e. run 30 kernels of size <<<1,1024>>> at the same time on a Kepler GPU with 15 SMs?

Jordan
- 305
- 3
- 13
2
votes
1 answer
Cuda stream Data Independency
Can somebody explain the data independency requirement in concurrent Cuda streams?
Assume i want to run the following kernel in 8 concurrent streams
Kernel<<>>(float *readOnlyInput, float *output);
can all streams read the same…

thanasisanthopoulos
- 55
- 1
- 8
1
vote
0 answers
Does a CUDA stream "become active" after execution of a scheduled host function concludes?
The CUDA documentation for scheduling the launching a host function (cuLaunchHostFunc) says:
Completion of the function does not cause a stream to become active except as described above.
I couldn't quite figure out what's "described above". As…

einpoklum
- 118,144
- 57
- 340
- 684
1
vote
1 answer
Is there a way to block and unblock a CUDA stream arbitrarily?
I need to pause the execution of all calls in a stream from a certain point in one part of the program until another part of the program decides to unpause this stream at an arbitrary time. This is the requirement of the application I'm working on,…

surabax
- 15
- 5
1
vote
0 answers
What are the new unique-id's for CUDA streams and contexts useful for?
CUDA 12 introduces two new API calls, cuStreamGetId() and cuCtxGetId() which return "unique ID"s associated with a stream or a context respectively. I'm struggling to understand why this is useful, or how this would be used. Are the handles for…

einpoklum
- 118,144
- 57
- 340
- 684
1
vote
1 answer
Getting total execution time of all kernels on a CUDA stream
I know how to time the execution of one CUDA kernel using CUDA events, which is great for simple cases. But in the real world, an algorithm is often made up of a series of kernels (CUB::DeviceRadixSort algorithms, for example, launch many kernels to…

Baxissimo
- 2,629
- 2
- 25
- 23
1
vote
0 answers
Performance slowdown in Titan V in the presence of Dynamic Parallelism code
I am having the following performance problem with CUDA. When I run a simple sample code on a Titan V and Titan X card, the running times are fine as expected.
Titan X: 0.269299 ms
Titan V: 0.111766 ms
Now, when I add another kernel in the code,…