1

I need to calculate how many flops per transferred value a code should provide so that running the code on GPU will be worth enough to increase the performance.

Here are the flop rates and assumptions:

1. PCIe 16x v3.0 bus is able to transfer data from CPU to GPU at a rate of 15.75 GB/s.

2. GPU is able to perform 8 single precision TFLOPs/second.

3. CPU is able to perform 400 single precision GFLOPs/second.

4. Single precision floating point number is 4 bytes.

5. Calculation can overlap with data transfers.

6. Data is originally placed in the CPU.

How would a problem like this be solved step by step?

Mert Şeker
  • 90
  • 1
  • 8
  • 1
    In practical terms, maximum observed bandwidth for PCIe gen3 transfers is 11-12 GB/sec. Maximum rate is typically achieved for blocks of 16 MB or more, it will be lower for smaller transfer sizes due to fixed overhead. The good news is that PCIe is a full duplex interconnect, so if you use a GPU with two DMA engines, you can do upstream and downstream transfers at that rate simultaneously. Are you sure you do not need to consider GPU memory bandwidth in your calculations? – njuffa Feb 25 '16 at 22:39
  • Well, GPU memory bandwith was not provided, so I'm sure that it's not necessary to consider it for the solution. Bandwidth for PCIe bus being higher than the maximum observed bandwidth shouldn't also be a problem since this is only a theoretical question and will not be used for any practical use. That being said, I'm still unsure about how to solve it with the given data. – Mert Şeker Feb 25 '16 at 22:49
  • So it seems the question assumes a compute-bound task. There seems to be a missing metric: How many bytes/FLOP does the application consume? Once you have that, you can compute total data volume assuming CPU and GPU run at the maximum FLOPs rate stated. You can then compute the amount of time needed for PCIe transfers of that data, and for computation by itself, on both CPU and GPU. BTW, I consider this question at the very border to an off-topic question. – njuffa Feb 25 '16 at 22:54
  • You are right about the missing metric and that is where i'm stuck at. 15.75GB/s is equivalent to 16 911 433 728 bytes/s which is equivalent to 4 227 858 432 single precision floating points/s. I think we assume that CPU and GPU run at the maximum FLOP rate as you mentioned. But still, without knowing the bytes/FLOP I can't see how this can be solved. – Mert Şeker Feb 25 '16 at 23:05
  • 1
    Stupid me. I noticed belatedly that the question *asks* for the bytes/FLOP ratio. So set up your equations, using the bytes/FLOP ratio as a *variable*, then *solve* for the value of that variable at which combined CPU/GPU performance equals the CPU performance, which is the cut-over point. So we have a math problem here, given in text form. I don't see how this is on-topic here. This isn't homework, by any chance? – njuffa Feb 25 '16 at 23:14
  • Yes, it was not on-topic but I did not realise that at first glampse. I apologize for being off-topic. Though, your explanations have been helpful and I was able to solve it with the help of that. Thank you nevertheless. – Mert Şeker Feb 26 '16 at 00:14
  • Don't forget about latency. If you need to branch on the result, the whole CPU->GPU->CPU pipeline won't be saturated. You say "Calculation can overlap with data transfers", so I guess you have considered this, and found it's not an issue for your case. – Peter Cordes Feb 26 '16 at 03:08

1 Answers1

0

Interpreting assumption 5 to mean the CPU isn't deranged in any way be transferring data to the GPU. There is obviously no reason not to use the GPU, you can only gain.

By not taking assumption 5 into account the question gets more interesting. Assuming while transferring data from CPU to GPU, the CPU can't calculate, we arrive at this: I think you are looking for the computaional intensity (=:ci) FLOP/byte at which it is beneficial to let the CPU halt its calculation to transfer data so the GPU can participate. Let's say you haved bytes of data to process with an algorithm of computational intensity ci. You split the data up into d_cpu and d_gpu with d_cpu+d_gpu=d. It takes t_1 = d_gpu / (15.75 GB/s) to transfer the data. Then you let both compute for t_2. Meaning t_2 = ci * d_gpu / (8 TFLOP/s) = ci * d_cpu / (400 GFLOP/s). The total time beeing t_3 = t_1 + t_2.

If the CPU does it all alone it needs t_4 = ci * d / (400 GFLOP/s).

So the point where both options take the same time is at

t_3 = t_4
t_1 + t_2 = t_4
d_gpu / (15.75 GB/s) + ci * d_gpu / (8 TFLOP/s) = ci * (d_cpu + d_gpu) / (400 GFLOP/s)

with

d_gpu / (8 TFLOP/s) = d_cpu / (400 GFLOP/s)

resulting in

ci ~= 1.2 FLOP/byte
Dominic Hofer
  • 5,751
  • 4
  • 18
  • 23
  • The question states that calculation *can* overlap with transfers. In real life, DMA from the GPU would compete for main memory bandwidth, so this only works if the CPU can work on some already-cached data while transfers happen. – Peter Cordes Jun 01 '16 at 15:21
  • True. But if the transfer can overlap with calculations. There would not be any reason not to use the GPU. – Dominic Hofer Jun 02 '16 at 06:40
  • Latency is one of the biggest. And it does take some CPU time to set up the transfer and calculations. Of course, I think the way this question is posed, the answer is that it's always worth it to use the GPU in parallel with the CPU under this oversimplified set of assumptions. – Peter Cordes Jun 02 '16 at 06:41