Performance and scalability of applications in parallel computers

Question

See the picture that is part of the Advanced Computer Architecture by Hwang which talks about the scalability of performance in parallel processing.

The questions are

1- Regarding figure (a), what are the examples for theta (exponential) and alpha (constant)? Which workloads grow exponentially by increasing the number of machines? Also, I haven't seen a constant workload when working with multi cores/computers.

2- Regarding figure (b), why the efficiency of exponential workloads are the highest? Can not understand that!

3- Regarding figure (c), what does fixed-memory model mean? A fixed memory workloads sounds like alpha which is noted as fixed-load model.

4- Regarding figure (c), what does fixed-time model mean? The term "fixed" is misguiding I think. I interpret that as "constant". The text says that fixed-time model is actually the linear in (a) gamma.

5- Regarding figure (c), why exponential model (memory bound) doesn't hit the communication bound?

You may see the text describing the figure below.

I also have to say that I don't understand the last line Sometimes, even if minimum time is achieved with mere processors, the system utilization (or efficiency) may be very poor!!

Can some one shed a light with some examples on that?

What exactly do you mean by "workload" here? What's the vertical axis in those graphs? Is that how many tasks you run in parallel? Or is it threads for a single task that might not parallelize perfectly? (Are we optimizing for latency of one task, or throughput of lots of similar tasks?) — Peter Cordes, Oct 13 '18 at 01:03
@peter-cordes: I think it is a single application which we are trying to run on multiple machine. For example, a 3D simulation where we increase the number of points, may take the memory exponentially. However, I am not sure about that. — mahmood, Oct 13 '18 at 05:51

Hadi Brais · Accepted Answer · 2018-10-14T14:34:41.110

Workload refers to the input size or problem size, which is basically the amount of data to be processed. Machine size is the number of processors. Efficiency is defined as speedup divided by the machine size. The efficiency metric is more meaningful than speedup¹. To see this, consider for example a program that achieves a speedup of 2X on a parallel computer. This may sound impressive. But if I also told you that the parallel computer has 1000 processors, a 2X speedup is really terrible. Efficiency, on the other hand, captures both the speedup and the context in which it was achieved (the number of processors used). In this example, efficiency is equal to 2/1000 = 0.002. Note that efficiency ranges between 1 (best) and 1/N (worst). If I just tell you that the efficiency is 0.002, you'd immediately realize that it's terrible, even if I don't tell you the number of processors.

Figure (a) shows different kinds of applications whose workloads can change in different ways to utilize a specific number of processors. That is, the applications scale differently. Generally, the reason you add more processors is to be able to exploit the increasing amount of parallelism available in larger workloads. The alpha line represents an application with a fixed-size workload, i.e, the amount of parallelism is fixed so adding more processors will not give any additional speedup. If the speedup is fixed but N gets larger, then the efficiency decreases and its curve would look like that of 1/N. Such an application has zero scalability.

The other three curves represent applications that can maintain high efficiency for with increasing number of processors (i.e., scalable) by increasing the workload in some pattern. The gamma curve represents the ideal workload growth. This is defined as the growth that maintains high efficiency but in a realistic way. That is, it does not put too much pressure on other parts of the system such as memory, disk, inter-processor communication, or I/O. So scalability is achievable. Figure (b) shows the efficiency curve of gamma. The efficiency slightly deteriorates due to the overhead of higher parallelism and due to the serial part of the application whose execution time does not change. This represents a perfectly scalable application: we can realistically make use of more processors by increasing the workload. The beta curve represents an application that is somewhat scalable, i.e., good speedups can be attained by increasing the workload but the efficiency deteriorates a little faster.

The theta curve represents an application where very high efficiency can be achieved because there is so much data that can be processed in parallel. But that efficiency can only be achieved theoretically. That's because the workload has to grow exponentially, but realistically, all of that data cannot be efficiently handled by the memory system. So such an application is considered to be poorly scalable despite of the theoretical very high efficiency.

Typically, applications with sub-linear workload growth end up being communication-bound when increasing the number of processors while applications with super-linear workload growth end up being memory-bound. This is intuitive. Applications that process very large amounts of data (the theta curve) spend of most of their time processing the data independently with little communication. On the other hand, applications that process moderate amounts of data (the beta curve) tend to have more communication between the processors where each processor uses a small amount of data to calculate something and then shares it with others for further processing. The alpha application is also communication-bound because if you use too many processors to process the fixed amount of data, then the communication overhead will be too high since each processor will operate on a tiny data set. The fixed-time model is called so because it scales very well (it takes about the same amount of time to process more data with more processors available).

I also have to say that I don't understand the last line Sometimes, even if minimum time is achieved with mere processors, the system utilization (or efficiency) may be very poor!!

How to reach the minimum execution time? Increase the number of processors as long as the speedup is increasing. Once the speedup reaches a fixed value, then you've reached the number of processors that achieve the minimum execution time. However, efficiency might be very poor if the speedup is small. This follows naturally from the efficiency formula. For example, suppose that an algorithm achieves a speedup of 3X on a 100-processor system and increasing the number of processors further will not increase the speedup. Therefore, the minimum execution time is achieved with a 100 processors. But efficiency is merely 3/100= 0.03.

Example: Parallel Binary Search

A serial binary search has an execution time equal to log₂(N) where N is the number of elements in the array to be searched. This can be parallelized by partitioning the array into P partitions where P is the number of processors. Each processor then will perform a serial binary search on its partition. At the end, all partial results can be combined in serial fashion. So the execution time of the parallel search is (log₂(N)/P) + (C*P). The latter term represents the overhead and the serial part that combines the partial results. It's linear in P and C is just some constant. So the speedup is:

log₂(N)/((log₂(N)/P) + (C*P))

and the efficiency is just that divided by P. By how much the workload (the size of the array) should increase to maintain maximum efficiency (or making the speedup as close to P as possible)? Consider for example what happens when we increase the input size linearly with respect to P. That is:

N = K*P, where K is some constant. The speedup is then:

log₂(K*P)/((log₂(K*P)/P) + (C*P))

How does the speedup (or efficiency) change as P approaches infinity? Note that the numerator has a logarithm term, but the denominator has a logarithm plus a polynomial of degree 1. The polynomial grows exponentially faster than the logarithm. In other words, the denominator grows exponentially faster than the numerator and the speedup (and hence the efficiency) approaches zero rapidly. It's clear that we can do better by increasing the workload at a faster rate. In particular, we have to increase exponentially. Assume that the input size is the of the form:

N = K^P, where K is some constant. The speedup is then:

log₂(K^P)/((log₂(K^P)/P) + (C*P))

= P*log₂(K)/((P*log₂(K)/P) + (C*P))

= P*log₂(K)/(log₂(K) + (C*P))

This is a little better now. Both the numerator and denominator grow linearly, so the speedup is basically a constant. This is still bad because the efficiency would be that constant divided by P, which drops steeply as P increases (it would look like the alpha curve in Figure (b)). It should be clear now the input size should be of the form:

N = K^P², where K is some constant. The speedup is then:

log₂(K^P²)/((log₂(K^P²)/P) + (C*P))

= P²*log₂(K)/((P²*log₂(K)/P) + (C*P))

= P²*log₂(K)/((P*log₂(K)) + (C*P))

= P²*log₂(K)/(C+log₂(K)*P)

= P*log₂(K)/(C+log₂(K))

Ideally, the term log₂(K)/(C+log₂(K)) should be one, but that's impossible since C is not zero. However, we can make it arbitrarily close to one by making K arbitrarily large. So K has to be very large compared to C. This makes the input size even larger, but does not change it asymptotically. Note that both of these constants have to be determined experimentally and they are specific to a particular implementation and platform. This is an example of the theta curve.

(1) Recall that speedup = (execution time on a uniprocessor)/(execution time on N processors). The minimum speedup is 1 and the maximum speedup is N.

Thanks for the explanation. How do you classify matrix addition/multiplication, quick sort, binary search. graph coloring and other classic applications? — mahmood, Oct 13 '18 at 16:41
@mahmood: why the QuickSort algorithm specifically, as opposed to sorting in general? Merge sort is more obviously parallelizable, except for the final merge step. (many papers have been written about parallel sorting, since it's such a classic problem). Binary search doesn't have any obvious parallelism at all, unless you have multiple queries. Matrix additional parallelizes trivially: each chunk of the result depends only on the corresponding chunks of the input. You should expect a perfect speedup (efficiency = 1) for any problem size, except for dispatch overhead. — Peter Cordes, Oct 13 '18 at 18:12
Parallel matmul can usually also scale efficiently to large numbers of processors, and runs well on GPUs. and even clusters of GPUs in separate machines, using cache-blocking techniques to produce memory locality. — Peter Cordes, Oct 13 '18 at 18:16
@mahmood You can determine this yourself. Just ask yourself, for a specific algorithm, how much the workload should increase to get maximum efficiency when the number of available processors increase from N to N+1? First, you have to define the algorithm precisely. — Hadi Brais, Oct 13 '18 at 19:24
So, I was asking to map some examples to alpha, teta, ... lines. I know that binary search won't be parallelized. Does that mean it is fixed load (alpha line)? Also, GPUs are good for matrix operations. Do you classify them as beta line?... — mahmood, Oct 14 '18 at 09:24
@mahmood I've updated the answer to discuss binary search as an example. Matrix multiplication, quick sort, graph coloring and anything else can be analyzed similarly. — Hadi Brais, Oct 14 '18 at 14:27
@HadiBrais: Thank you very much. I mark it as answer with vote. — mahmood, Oct 14 '18 at 19:13
@HadiBrais: Do you have any note about this question https://stackoverflow.com/questions/52823403/quantitative-metrics-for-parallelism :) — mahmood, Oct 19 '18 at 12:26

Performance and scalability of applications in parallel computers

1 Answers1