Motivation: I have been tasked with measuring the Karp-Flatt metric and parallel efficiency of my CUDA C code, which requires computation of speedup. In particular, I need to plot all these metrics as a function of the number of processors p
.
Definition: Speedup refers to how much a parallel algorithm is faster than a corresponding sequential algorithm, and is defined as:
Issue: I have implemented my algorithm in CUDA C, and have timed it to get Tp
. However, there remains some issues in determining Sp
:
- How to observe
T1
without completely rewriting my code from scratch?- Can I execute CUDA code in serial???
- What is
p
when I run different kernels with different numbers of threads?- Does it refer to no. of threads or no. of processors used throughout runtime?
- Since both of these quantities will also vary throughout runtime, is it the maximum or the average used?
- How do I even restrict my code to run on a subset of processors or with fewer threads!?
Many thanks.