Questions tagged [cuda-streams]

CUDA streams are the hardware-supported queues on CUDA GPUs through which work is scheduled (kernel launches, memory transfers etc.)

78 questions
0
votes
1 answer

Using __constant__ memory with MPI and streams

If I have a __constant__ value __constant__ float constVal; Which may or may not be initialized by MPI ranks on non-blocking streams: cudaMemcpyToSymbolAsync((void*)&constVal,deviceValue,sizeof(float),0,cudaMemcpyDeviceToDevice,stream); Is…
Jacob Faib
  • 1,062
  • 7
  • 22
0
votes
1 answer

CUDA cudaMemcpyAsync using single stream to host

I have a single kernel which is feeling data to two parameters (dev_out_1 and dev_out_2) using single stream. I wanted to copy back the data from the device to host in parallel. my requirement is to use single stream and copy back to the host in…
Yona
  • 25
  • 2
  • 6
0
votes
1 answer

CUDA C++ overlapping SERIAL kernel execution and data transfer

So this guide here shows the general way to overlap kernel execution and data transfer. cudaStream_t streams[nStreams]; for (int i = 0; i < nStreams; ++i) { cudaStreamCreate(&streams[i]); int offset = ...; cudaMemcpyAsync(&d_a[offset],…
Duke Le
  • 332
  • 3
  • 14
0
votes
1 answer

Is it possible to manually set the SMs used for one CUDA stream?

By default, the kernel will use all available SMs of the device (if enough blocks). However, now I have 2 stream with one computational-intense and one memory-intense, and I want to limit the maximal SMs used for 2 stream respectively (after setting…
0
votes
1 answer

Why could OpenCV wait for a stream-ed CUDA operation instead of proceeding asynchronously?

I'm trying to perform some image dilation using OpenCV & CUDA. I invoke two calls to filter->apply(...) with a different filter object and on a different Mat, after each other, every time specifying a different stream to work with. They DO get…
BIOStheZerg
  • 396
  • 4
  • 19
0
votes
1 answer

Overlapping transfers and kernel executions in CUDA with two loops

I want to overlap data transfers and kernel executions in a form like this: int numStreams = 3; int size = 10; for(int i = 0; i < size; i++) { cuMemcpyHtoDAsync( _bufferIn1, _host_memoryIn1 ), …
Eagle06
  • 71
  • 1
  • 7
0
votes
1 answer

CUDA graph stream capture with thrust::reduce

When I am trying to capture stream execution to build CUDA graph, call to thrust::reduce causes a runtime error cudaErrorStreamCaptureUnsupported: operation not permitted when stream is capturing. I have tried returning the reduction result to both…
Cos_ma
  • 75
  • 9
0
votes
1 answer

CUDA global atomic operations across concurrent kernel executions

My CUDA application performs an associative reduction over a volume. Essentially each thread computes values which are atomically added to overlapping locations of the same output buffer in global memory. Is it possible to concurrently launch this…
AnimatedRNG
  • 1,859
  • 3
  • 26
  • 39
0
votes
0 answers

Using cv::cuda::stream for asynchronous processing of images in opencv

I am using OpenCV 3.4 with cuda libraries to process video images. Image is grabbed and uploaded over the device using GpuMat::upload(). Afterward the image is thresholded twice to create 2 different binary images (Th1 and Th2). My first question is…
Ali Nouri
  • 67
  • 7
0
votes
1 answer

Is cuStreamAddCallback as effective as cuStreamSynchronize in having latest bits of data on host?

In CUDA(driver API) documentation, it says The start of execution of a callback has the same effect as synchronizing an event recorded in the same stream immediately prior to the callback. It thus synchronizes streams which have been "joined" …
huseyin tugrul buyukisik
  • 11,469
  • 4
  • 45
  • 97
0
votes
1 answer

Asynchronous behavior of CUDA events within a CUDA stream

This question is about notion of a CUDA stream (Stream) and the apparent anomaly with CUDA events (Event) recorded on a stream. Consider the following code demonstrating this anamoly, cudaEventRecord(eventStart, stream1) kernel1<<<...,…
kesari
  • 536
  • 1
  • 6
  • 16
0
votes
1 answer

Enqueueing an async copy from a CUDA callback - not permitted?

This program: #include #include struct buffers_t { void* host_buffer; void* device_buffer; }; void ensure_no_error(std::string message) { auto status = cudaGetLastError(); if (status != cudaSuccess) { …
einpoklum
  • 118,144
  • 57
  • 340
  • 684
0
votes
1 answer

CUDA streams performance

I am currently learning CUDA streams through the computation of a dot product between two vectors. The ingredients are a kernel function that takes in vectors x and y and returns a vector result of size equal to the number of blocks, where each…
iNvId
  • 1
0
votes
1 answer

Kernel invoking delay on CUDA with Streams

I have created the Scan Algorithm for CUDA from scratch and trying to use it for smaller data amounts less than 80,000 bytes. Two separate instances were created where, one runs the kernels using streams where possible and the other runs only in the…
BAdhi
  • 420
  • 7
  • 19
0
votes
1 answer

Multiple kernel calls in CUDA

I'm trying to call the same kernel on CUDA (with one different input parameter) more times, but it executes only the first one and doesn't follow with other kernel calls. Assume the inputs arrays are new_value0=[123.814935276; 234; 100; 166;…
adry_b89
  • 43
  • 9