CUDA streams are the hardware-supported queues on CUDA GPUs through which work is scheduled (kernel launches, memory transfers etc.)
Questions tagged [cuda-streams]
78 questions
0
votes
1 answer
Using __constant__ memory with MPI and streams
If I have a __constant__ value
__constant__ float constVal;
Which may or may not be initialized by MPI ranks on non-blocking streams:
cudaMemcpyToSymbolAsync((void*)&constVal,deviceValue,sizeof(float),0,cudaMemcpyDeviceToDevice,stream);
Is…

Jacob Faib
- 1,062
- 7
- 22
0
votes
1 answer
CUDA cudaMemcpyAsync using single stream to host
I have a single kernel which is feeling data to two parameters (dev_out_1 and dev_out_2) using single stream. I wanted to copy back the data from the device to host in parallel.
my requirement is to use single stream and copy back to the host in…

Yona
- 25
- 2
- 6
0
votes
1 answer
CUDA C++ overlapping SERIAL kernel execution and data transfer
So this guide here shows the general way to overlap kernel execution and data transfer.
cudaStream_t streams[nStreams];
for (int i = 0; i < nStreams; ++i) {
cudaStreamCreate(&streams[i]);
int offset = ...;
cudaMemcpyAsync(&d_a[offset],…

Duke Le
- 332
- 3
- 14
0
votes
1 answer
Is it possible to manually set the SMs used for one CUDA stream?
By default, the kernel will use all available SMs of the device (if enough blocks). However, now I have 2 stream with one computational-intense and one memory-intense, and I want to limit the maximal SMs used for 2 stream respectively (after setting…

Subject_No_i
- 33
- 2
0
votes
1 answer
Why could OpenCV wait for a stream-ed CUDA operation instead of proceeding asynchronously?
I'm trying to perform some image dilation using OpenCV & CUDA. I invoke two calls to filter->apply(...) with a different filter object and on a different Mat, after each other, every time specifying a different stream to work with. They DO get…

BIOStheZerg
- 396
- 4
- 19
0
votes
1 answer
Overlapping transfers and kernel executions in CUDA with two loops
I want to overlap data transfers and kernel executions in a form like this:
int numStreams = 3;
int size = 10;
for(int i = 0; i < size; i++) {
cuMemcpyHtoDAsync( _bufferIn1,
_host_memoryIn1 ),
…

Eagle06
- 71
- 1
- 7
0
votes
1 answer
CUDA graph stream capture with thrust::reduce
When I am trying to capture stream execution to build CUDA graph, call to thrust::reduce causes a runtime error cudaErrorStreamCaptureUnsupported: operation not permitted when stream is capturing. I have tried returning the reduction result to both…

Cos_ma
- 75
- 9
0
votes
1 answer
CUDA global atomic operations across concurrent kernel executions
My CUDA application performs an associative reduction over a volume. Essentially each thread computes values which are atomically added to overlapping locations of the same output buffer in global memory.
Is it possible to concurrently launch this…

AnimatedRNG
- 1,859
- 3
- 26
- 39
0
votes
0 answers
Using cv::cuda::stream for asynchronous processing of images in opencv
I am using OpenCV 3.4 with cuda libraries to process video images. Image is grabbed and uploaded over the device using GpuMat::upload(). Afterward the image is thresholded twice to create 2 different binary images (Th1 and Th2). My first question is…

Ali Nouri
- 67
- 7
0
votes
1 answer
Is cuStreamAddCallback as effective as cuStreamSynchronize in having latest bits of data on host?
In CUDA(driver API) documentation, it says
The start of execution of a callback has the same effect as
synchronizing an event recorded in the same stream immediately prior
to the callback. It thus synchronizes streams which have been "joined"
…

huseyin tugrul buyukisik
- 11,469
- 4
- 45
- 97
0
votes
1 answer
Asynchronous behavior of CUDA events within a CUDA stream
This question is about notion of a CUDA stream (Stream) and the apparent anomaly with CUDA events (Event) recorded on a stream.
Consider the following code demonstrating this anamoly,
cudaEventRecord(eventStart, stream1)
kernel1<<<...,…

kesari
- 536
- 1
- 6
- 16
0
votes
1 answer
Enqueueing an async copy from a CUDA callback - not permitted?
This program:
#include
#include
struct buffers_t {
void* host_buffer;
void* device_buffer;
};
void ensure_no_error(std::string message) {
auto status = cudaGetLastError();
if (status != cudaSuccess) {
…

einpoklum
- 118,144
- 57
- 340
- 684
0
votes
1 answer
CUDA streams performance
I am currently learning CUDA streams through the computation of a dot product between two vectors. The ingredients are a kernel function that takes in vectors x and y and returns a vector result of size equal to the number of blocks, where each…

iNvId
- 1
0
votes
1 answer
Kernel invoking delay on CUDA with Streams
I have created the Scan Algorithm for CUDA from scratch and trying to use it for smaller data amounts less than 80,000 bytes.
Two separate instances were created where, one runs the kernels using streams where possible and the other runs only in the…

BAdhi
- 420
- 7
- 19
0
votes
1 answer
Multiple kernel calls in CUDA
I'm trying to call the same kernel on CUDA (with one different input parameter) more times, but it executes only the first one and doesn't follow with other kernel calls.
Assume the inputs arrays are
new_value0=[123.814935276; 234; 100; 166;…

adry_b89
- 43
- 9