Questions tagged [cuda-streams]

CUDA streams are the hardware-supported queues on CUDA GPUs through which work is scheduled (kernel launches, memory transfers etc.)

78 questions
0
votes
1 answer

Overlapping Data Transfers in Maxwell (GPU Nvidia)

I´m newbie in the forum and I hope that you will help me with my question. Recently, I´ve developed an application in which I´ve used CUDA streams with the aim of overlapping computation and data transfers. I've executed this application on a GPU…
0
votes
1 answer

CUDA - process a single pixel buffer data (array) on multiple simultaneous kernels, is it possible?

Currently I have one pixel buffer and I process the data in it with a single kernel call: dim3 threadsPerBlock(32, 32) dim3 blocks(screenWidth / threadsPerBlock.x, screenHeight / threadsPerBlock.y); kernel<<>>(); The pixel…
Geto
  • 191
  • 4
  • 15
0
votes
1 answer

How many cudaMemcpyAsync operations can be done concurrently?

Considering the following case: //thread 0 on device 0: cudaMemcpyAsync(Dst0, Src0, ..., stream0);//stream0 is on Device 0; ... //thread 1 on device 1: cudaMemcpyAsync(Dst1, Src1, ..., stream1);//stream1 is on Device 1; Can the two memcpy…
user2188453
  • 1,105
  • 1
  • 12
  • 26
0
votes
1 answer

Multiple CUDA streams crashing GPU

This is a continuation of this post. It seems as though a special case has been solved by adding volitile but now something else has broken. If I add anything between the two kernel calls, the system reverts back to the old behavior, namely freezing…
jrk0414
  • 144
  • 1
  • 1
  • 11
0
votes
1 answer

Reading updated memory from other CUDA stream

I am trying to set a flag in one kernel function and read it in another. Basically, I'm trying to do the following. #include #include
jrk0414
  • 144
  • 1
  • 1
  • 11
0
votes
2 answers

How can streams offer concurrent execution in CUDA?

In the CUDA documentation, it is mentioned that if we use 2 streams (stream0 and stream1) like this way: we copy data in stream0 then we launch the first kernel in stream0 , then we recuperate data from the device in stream0, and then the same…
Sara Dev
  • 1
  • 1
0
votes
1 answer

How does the GK110's Hyper-Q enable concurrency of multiple streams?

If I want to benefit from Kepler GK110's Hyper-Q mechanism, i.e., to make two streams be put into two different hardware work queues to avoid some false dependencies, is it necessary for me to create the two streams with two CPU threads or the…
troore
  • 777
  • 1
  • 6
  • 15
0
votes
3 answers

CUDA Overlap Data is not working

Using steams to overlap data transfer with kernel execution is not working in my system. Hello I want to use Overlapping computation and data transfers in CUDA ,but I can't. NVIDIA help document say Overlapping computation and data transfers is…
0
votes
2 answers

Cuda Stream Processing for multiple kernels Disambiguation

Hi a few questions regarding Cuda stream processing for multiple kernels. Assume s streams and a kernels in a 3.5 capable kepler device, where s <= 32. kernel uses a dev_input array of size n and a dev output array of size s*n. kernel reads data…
0
votes
1 answer

CUDA stream is slower than usual kernel

I am trying to understand CUDA streams and I have made my first program with streams, but It is slower than usual kernel function... why is this code slower cudaMemcpyAsync(pole_dev, pole, size, cudaMemcpyHostToDevice, stream_1); …
Snurka Bill
  • 973
  • 4
  • 12
  • 29
0
votes
1 answer

Concurrent: Short copy, Long kernel

When running concurrent copy & kernel operations: If I have a kernel runTime that is twice as long as a dataCopy operation, will I get 2 copies per kernel run? The stream examples I'm seeing show a 1:1 relationship. (Time of copy = time of kernel…
Doug
  • 2,783
  • 6
  • 33
  • 37
0
votes
1 answer

Cuda, why I cannot use more than one streaming processor?

I implemented a RNS Montgomery exponentiation in Cuda. Everything nice everything fine. It runs on just one SM. BUT, so far I focus on parallelization of just a single exp. What I want to do now is test with several exp on fly. That is, I want that…
elect
  • 6,765
  • 10
  • 53
  • 119
-1
votes
1 answer

Wrong results using CUDA streams and memCpyAsync, become correct adding cudaDeviceSynchronize

I'm working on a CUDA matrix multiplication, but I did some modifications to observe how they affect performances. I want to observe the behavior and performances of a matrix multiplication kernel, making some changes. I'm measuring the changes in…
Maria Chiara
  • 103
  • 8
-1
votes
1 answer

Why operations in two CUDA Streams are not overlapping?

My program is a pipeline, which contains multiple kernels and memcpys. Each task will go through the same pipeline with different input data. The host code will first chooses a Channel, an encapsulation of scratchpad memory and CUDA objects, when it…
StrikeW
  • 501
  • 1
  • 4
  • 11
-1
votes
1 answer

Why am I not getting I/O-compute overlap with this code?

The following program: #include #include using clock_value_t = long long; __device__ void gpu_sleep(clock_value_t sleep_cycles) { clock_value_t start = clock64(); clock_value_t cycles_elapsed; do { cycles_elapsed =…
einpoklum
  • 118,144
  • 57
  • 340
  • 684