How to calculate the speedup of a GPU program?

Question

Motivation: I have been tasked with measuring the Karp-Flatt metric and parallel efficiency of my CUDA C code, which requires computation of speedup. In particular, I need to plot all these metrics as a function of the number of processors p.

Definition: Speedup refers to how much a parallel algorithm is faster than a corresponding sequential algorithm, and is defined as:

enter image description here

Issue: I have implemented my algorithm in CUDA C, and have timed it to get Tp. However, there remains some issues in determining Sp:

How to observe T1 without completely rewriting my code from scratch?
- Can I execute CUDA code in serial???
What is p when I run different kernels with different numbers of threads?
- Does it refer to no. of threads or no. of processors used throughout runtime?
- Since both of these quantities will also vary throughout runtime, is it the maximum or the average used?
- How do I even restrict my code to run on a subset of processors or with fewer threads!?

Many thanks.

If I remember correctly (coming from an OpenCL background), if you set the number of kernels to one, wouldn't that be the same as running the program in serial? — Chase Walden, Jan 15 '13 at 20:46
@ChaseWalden You still use several cores on the GPU assuming block dim > 1, the only way is to use one kernel of size one thread, but this is sort of baseless as GPU and CPU is too diverse. It would make more sense to implement a CPU bound algorithm and compare. — 1-----1, Jan 15 '13 at 20:50
@ks6g10 so if I am understanding you correctly, you want to calculate the speedup from a program run in serial on the CPU to the program run on multiple kernels on the GPU? — Chase Walden, Jan 15 '13 at 20:55
@ChaseWalden For me that seems to be the domain you want to try, as if if the CPU is faster to do it by a notable amount (e.g. 2x), why then do it on the GPU? At least that is what I do for my research. — 1-----1, Jan 15 '13 at 21:02
@ChaseWalden You said he should just run one kernel, I specified it should be one kernel of one thread if he would do that. But then I suggested him that he should do a CPU bound algorithm instead to measure performance du to the platforms being so different. — 1-----1, Jan 15 '13 at 21:12

Ira Baxter · Accepted Answer · 2014-07-10T00:49:57.310

11

To get a reasonable measure of speedup, you need the actual sequential program. If you don't have one, you need to write the best sequential version you can, because comparing a highly tuned parallel code to a junk serial implementation is unreasonable.

Nor can you reasonably compare a 1-processor version of your parallel program to the N-processor version to get a true measure of speedup. Such a comparison tells you speedup from going from P=1 to P=N for the same program, but the point of the speedup curves is to show why building a parallel program (which is usually harder amd requires more complicated hardware [GPU] and tools [OpenCL]) makes sense compared to coding the best sequential version using more widely available hardware and tools.

In other words, no cheating.

edited Jul 10 '14 at 00:49

answered Jan 15 '13 at 21:06

Ira Baxter

93,541
22
172
341

1

Good answer, but do you think that say, ranking the best CPU vs the best GPU makes sense? Or should it be so that you would also consider the hardware cost? – 1-----1 Jan 15 '13 at 21:16
You say "such a comparison tells you speedup from going from P=1 to P=N for the same program" as if this is missing the main point of computing speedup - isn't this the *entire* point of measuring speedup? – mchen Jan 15 '13 at 21:18
Moreover, with respect to GPGPU, does the *`p`* used in the definition of speedup refer to no. of threads or no. of processors? Indeed, since both of these quantities will also vary throughout runtime, is it the maximum or the average used? – mchen Jan 15 '13 at 21:22
1

@MiloChen I think he is suggesting that you have the capability of running multi threaded code which in most cases would be beneficial(and could be faster than the GPU), and being lazy and not evaluating the possibility is wrong. – 1-----1 Jan 15 '13 at 21:24
@ksg6g10: Well, a "multithreaded" code is another kind of parallel program. To get a true measure of parallel speedup *over a sequential program*, he has to compare to a sequential program. He can do another comparision to a multithreaded, non-GPU application to indicate payoff of going that way. But ultimately speedup is judged against what is straightforward to write with standard resources and effort, against what you can get with the extra resources and effort. If it is trivial to code a multithreaded app (not always so for C and C++), then that might make in intersting comparison. – Ira Baxter Jan 15 '13 at 21:55
This answer is completely correct in that you cannot simply run your parallel algorithm on one thread. You must write an optimized serial algorithm and compare against that. – Joe Jan 19 '13 at 22:43

score 0 · Answer 2 · answered Jan 19 '13 at 22:48

When measuring speedup you must in most cases completely write both the serial and the parallel algorithms from scratch. There is no particular reason that the best parallel algorithm with P=1 has anything in common with the optimal serial algorithm. In most cases the parallel algorithm will have to do lots of extra work and is quite inefficient compared to an optimal serial algorithm.

How to calculate the speedup of a GPU program?

2 Answers2