Questions tagged [cuda-streams]

CUDA streams are the hardware-supported queues on CUDA GPUs through which work is scheduled (kernel launches, memory transfers etc.)

78 questions
1
vote
2 answers

CUDA 4.0 RC - many host threads per one GPU - cudaStreamQuery and cudaStreamSynchronize behaviour

I wrote a code which uses many host (OpenMP) threads per one GPU. Each thread has its own CUDA stream to order it requests. It looks very similar to below code: #pragma omp parallel for num_threads(STREAM_NUMBER) for (int sid = 0; sid <…
kokosing
  • 5,251
  • 5
  • 37
  • 50
1
vote
1 answer

How big is a cudaStream_t?

I have inherited some code that basically does stuff like this: void *stream; cudaStreamCreate((cudaStream_t *)&stream); Looking at targets/x86_64-linux/driver_types.h for CUDA 8, I see: typedef __device_builtin__ struct CUStream_st…
Ken Y-N
  • 14,644
  • 21
  • 71
  • 114
1
vote
2 answers

Why do cudaMemcpyAsync and kernel launches block even with an asynchronous stream?

Consider the following program for enqueueing some work on a non-blocking GPU stream: #include using clock_value_t = long long; __device__ void gpu_sleep(clock_value_t sleep_cycles) { clock_value_t start = clock64(); …
einpoklum
  • 118,144
  • 57
  • 340
  • 684
1
vote
1 answer

CUDA streams are blocking despite Async

I'm working on a video stream in real time that I try to process with a GeForce GTX 960M. (Windows 10, VS 2013, CUDA 8.0) Each frame has to be captured, lightly blured, and whenever I can, I need to do some hard-work calculations on the 10 latest…
Charlie Echo
  • 87
  • 1
  • 5
1
vote
1 answer

CUDA FFT plan reuse across multiple 'overlapped' CUDA Stream launches

I'm in trying to improve the performance of my code using asynchronous memory transfer overlapped with GPU computation. Formerly I had a code where I created an FFT plan, and then make use of it multiple times. In such situation the time invested…
1
vote
1 answer

The behavior of stream 0 (default) and other streams

In CUDA, how is stream 0 related to other streams? Does stream 0 (default stream) execute concurrently with other streams in a context or not? Considering the following example: cudaMemcpy(Dst, Src, sizeof(float)*datasize,…
user2188453
  • 1,105
  • 1
  • 12
  • 26
1
vote
1 answer

Global Memory and CUDA streams

I'm working on CUDA and I have a doubt about global memory and streams CUDA. Let: __device__ float Aux[32]; __global__ void kernel1(...) { [...] Aux[threadIdx.y] = 0; [...] } So, if I run this kernel on different streams GPU. Is Aux the…
userCUDA
  • 11
  • 2
1
vote
1 answer

Stream scheduling order

The way I see both Process One & Process Two (below), are equivalent in that they take the same amount of time. Am I wrong? allOfData_A= data_A1 + data_A2 allOfData_B= data_B1 + data_B2 allOFData_C= data_C1 + data_C2 Data_C is the output of the…
Doug
  • 2,783
  • 6
  • 33
  • 37
0
votes
0 answers

The cudaMemcpyAsync interaction with pageable host memory

I am beginning to learn cuda programming. In learning the streams and the async/sync features, I have encountered some problems. As said in the Nvidia docs and many sources, the cudaMemcpyAsync can be used to realize the overlapping of data transfer…
0
votes
0 answers

Gstreamer create custom CUDA plugin

Want to implement a custom plugin which process only the GPU frames (memory:CUDAMemory) and also update the frame (Consider creating an overlay on the video). $./gst-launch-1.0 videotestsrc ! cudaupload ! 'video/x-raw(memory:CUDAMemory)' !…
0
votes
1 answer

What does CU_MEMPOOL_ATTR_REUSE_ALLOW_OPPORTUNISTIC actually allow?

One of the attributes of CUDA memory pools is CU_MEMPOOL_ATTR_REUSE_ALLOW_OPPORTUNISTIC, described in the doxygen as follows: Allow reuse of already completed frees when there is no dependency between the free and allocation. If a free (a…
einpoklum
  • 118,144
  • 57
  • 340
  • 684
0
votes
1 answer

Is it possible to execute more than one CUDA graph's host execution node in different streams concurrently?

Investigating possible solutions for this problem, I thought about using CUDA graphs' host execution nodes (cudaGraphAddHostNode). I was hoping to have the option to block and unblock streams on the host side instead of the device side with the wait…
surabax
  • 15
  • 5
0
votes
1 answer

What is cuEventRecord guaranteed to do if it gets the default-stream's handle?

Suppose I call cuEventRecord(0, my_event_handle). cuEventRecord() requires the stream and the event to belong to the same context. Now, one can interpret the 0 as "the default stream in the appropriate context" - the requirements are satisfied and…
einpoklum
  • 118,144
  • 57
  • 340
  • 684
0
votes
1 answer

How can I make sure two kernels in two streams are sent to the GPU at the same time to run?

I am beginner in CUDA. I am using NVIDIA Geforce GTX 1070 and CUDA toolkit 11.3 and ubuntu 18.04. As shown in the code below, I use two CPU threads to send two kernels in the form of two streams to a GPU. I want exactly these two kernels to be sent…
mehran
  • 191
  • 10
0
votes
1 answer

Reusing cudaEvent to serialize multiple streams

Suppose I have a struct: typedef enum {ON_CPU,ON_GPU,ON_BOTH} memLocation; typedef struct foo *foo; struct foo { cudaEvent_t event; float *deviceArray; float *hostArray; memLocation arrayLocation; }; a function: void…
Jacob Faib
  • 1,062
  • 7
  • 22