1

The way I see both Process One & Process Two (below), are equivalent in that they take the same amount of time. Am I wrong?

allOfData_A= data_A1 + data_A2
allOfData_B= data_B1 + data_B2
allOFData_C= data_C1 + data_C2
Data_C is the output of the kernel operation of both Data_A & Data_B.  (Like C=A+B)
The HW supports one DeviceOverlap (concurrent) operation.

Process One:

MemcpyAsync data_A1 stream1 H->D
MemcpyAsync data_A2 stream2 H->D
MemcpyAsync data_B1 stream1 H->D
MemcpyAsync data_B2 stream2 H->D
sameKernel stream1
sameKernel stream2
MemcpyAsync result_C1 stream1 D->H
MemcpyAsync result_C2 stream2 D->H

Process Two: (Same operation, different order)

MemcpyAsync data_A1 stream1 H->D
MemcpyAsync data_B1 stream1 H->D
sameKernel stream1
MemcpyAsync data_A2 stream2 H->D
MemcpyAsync data_B2 stream2 H->D
sameKernel stream2
MemcpyAsync result_C1 stream1 D->H
MemcpyAsync result_C2 stream2 D->H
einpoklum
  • 118,144
  • 57
  • 340
  • 684
Doug
  • 2,783
  • 6
  • 33
  • 37
  • The order in which they are placed is very very important, you should place it in an order that when the first kernel is being executed the second data had already begun to being copied. – Soroosh Bateni Feb 12 '13 at 16:58
  • There are.different answers to this question depending.on what hardware you are using and what the nature of the kernels are. Please be more specific. – talonmies Feb 12 '13 at 17:10
  • It was intended to be a general question. Let's say it is the Fermi HW, w/ one cpy engine. (Is the Kepler different?) What other specific's are you looking for? – Doug Feb 12 '13 at 17:33

1 Answers1

3

Using CUDA streams allows the programmer to express work dependencies by putting dependent operations in the same stream. Work in different streams is independent and can be executed concurrently.

On GPUs without HyperQ (compute capability 1.0 to 3.0) you can get false dependencies because the work for a DMA engine or for computation gets put into a single hardware pipe. Compute capability 3.5 brings HyperQ which allows for multiple hardware pipes and there you shouldn't get the false dependencies. The simpleHyperQ example illustrates this, and the documentation shows diagrams to explain what is going on more clearly.

Putting it simply, on devices without HyperQ you would need to do a breadth-first launch of your work to get maximum concurrency, whereas for devices with HyperQ you can do a depth-first launch. Avoiding the false dependencies is pretty easy, but not having to worry about it is easier!

Tom
  • 20,852
  • 4
  • 42
  • 54