0

This is how I solved the following question, I want to be sure if my solution is correct?

A multiprocessor consists of 100 processors, each capable of a peak execution rate of 2 Gflops. What is the performance of the system as measured in Gflops when 2% of the code is sequential and 98% is parallelizable

Solution : I think I'm right in thinking that 2% of the program is run at 2 GFLOPs, and 98% is run at 200 GFLOPs, and that I can average these speeds to find the performance of the multiprocessor in GFLOPs

(2/100)*2 + (98/100)*200 = 196,04 Gflops

I want to be sure if my solution is correct?

Mehmet
  • 1
  • 1

3 Answers3

1

From my understanding, it is 2% of the program that is sequential and not 2% of the execution time. This means the sequential code takes a significant portion of the time since there are a lot of processor so the parallel part is drastically accelerated.

With your method, a program with 50% of sequential code and 1000 processors will run at (50/100)*2 + (50/100)*2_000 = 1001 Gflops. This means that all processors are use at ~50% of their maximum capacity in average during all the execution of the program which is too good to be possible. Indeed, the parallel part of the program should be so fast that it will take only a tiny faction of the execution time (<5%) while the sequential part will take almost all the time of the program (>95%). Since the largest part of the execution time runs at 2 Gflops, the processors cannot be used at ~50% of their capacity!

Based on the Amdahl's law, you can compute the actual speed up of this code:

Slat = 1 / ((1-p) + p/s) where Slat is the speed up of the whole program, p the portion of parallel code (0.98) and s is the number of processors (100). This means Slat = 33.6. Since one processor runs at 2 Gflops and the program is 33.6 time faster overall using many processors, the overall program runs at 33.6 * 2 = 67.2 Gflops.

What the Amdahl's law show is that a tiny fraction of the execution time being sequential impact strongly the scalability and thus the performance of parallel programs.

Jérôme Richard
  • 41,678
  • 6
  • 29
  • 59
  • Given the Question asked something else, could you explain, what was the reason to introduce a (theoretical & wrong,as you decided to use old,overhead-naive Amdahl's Law formula,i.e.NOT accounting for all CPU-instructions (added),spent to prepare & populate parts of work onto a set of CPUs & does neither reflect all the add-on costs, added to a pure-[SERIAL] baseline code-execution time,against which a Speedup is benchmarked against - with an effect to then compare apples (instruction mix for serial-code) to oranges (a LOT DIFFERENT + larger, mix present in [CONCURRENT]or[PARALLEL] code)?Thx. – user3666197 Apr 10 '22 at 08:29
  • I do not understand your point. The OP question is a theoretical one. In practice, a code is never truly parallel nor truly sequential, synchronizations matter a lot and the concept of serial part makes no sense on modern CPUs. However, this is typically the type of simplified question asked in parallelism course so student can understand the basics of parallel computing (and to my country this is what we decided to teach to students). You seems to say that some overheads are not taken into account but the question does not give enough information to consider them. Please clarify your point. – Jérôme Richard Apr 10 '22 at 10:41
  • My fault not to express the point better.Comments are nano-scaled spaces to do it better. Perhaps there could be other platforms to go into needed details,as I agree that education is the core in doing this right (or wrong,if damaging the actual comprehension by (unsupported over)simplification(s) et al).My greatest surprise was on mixing synthetic HW ceiling [FLOPS] with real-process Amdahl's speedup (an actual process' improvement principal limit if adding even infinitely many free processors/resources but (why?) ignoring add-on costs to do so: https://www.desmos.com/calculator/zfrrlfeiji ) – user3666197 Apr 10 '22 at 12:28
  • Last but not least,artificial "boosting" of marketing-mania for advertising FLOPS skews the subject.Processors with AVX-512 uops (+misusing the same "trick" for GPU/TPU SMXs) all over-promise how many TFLOPS & PFLOPS these toys can "produce" if used in HPC supercomputers,yet as these have often "benchmarked" on 8-bit "arithmetics"( taking less time do to optimised in-silicon tricks,or MUXed(doing many FMULs in 512-bit AVX-registers in a hardwired-ILP parallel-fashion,one will never get "Those same GFLOPS" for any real-world computing with 64/128-bit data. So not all GFLOPS are equal, are they? – user3666197 Apr 10 '22 at 12:39
  • Assumption of *(cit.) "...part is drastically accelerated."* builds on a wish, definitely not supported by any evidence or facts from the as-is Problem-under-Review definition above, that there were no such other SYSTEM resources, that will get traffic-jammed or blocked during that "drastically"-increased concurrent use.As for the last about 20 years,all CPUs are so called MEMORY-STARVED (so fast, that most of the time they are just waiting hungry for data to be fetched from memory - as physical memory-I/O channels are the bottleneck for even serial (even 1 & only 1 CPU get starved) so x100? – user3666197 Apr 10 '22 at 12:57
0

Forgive me to start light & anecdotally,
citing a meme from my beloved math professor,
later we will see why & how well it helps us here

2 + 7 = 15 . . . , particularly so for higher values of 7

It is ideal if I start off with stating some definitions:

a) GFLOPS
is a unit, that measures how many operations in FLO-ating point arithmetics, with no particular kind thereof specified (see remark 1), were performed P-er S-econd ~ FLOPS, here expressed for convenience in multiples of billion ( G-iga ), i.e. the said GFLOPS

b) processor, multi-processor
is a device ( or some kind of composition of multiple such same devices, expressed as multi-p. ), used to perform some kind of a useful work - a processing

This pair of definitions was necessary to further judge the asked question to solve.

The term (a) is a property of (b), irrespective of all other factors, if we assume such "device" not to be a kind of some polymorphic, self-modifying FGPA or evolutionary reflective self-evolving amoeboid, which both processors & multi-processors prefer not to be, at least in our part of Universe as we know it in 2022-Q2.

Once manufactured, each kind of processor(b) (be it a monolithic or a multi-processor device) has certain, observable, repetitively measurable, qualities of processing (doing a work).

"A multiprocessor consists of 100 processors, each capable of a peak execution rate of 2Gflops. What is the performance of the system as measured in Gflops when 2% of the code is sequential and 98% is parallelizable"

A multiprocessor . . . (device)
        consists . . .          has a property of being composed of
             100 . . . (quantitative factor) ~ 100
      processors,. . . (device)
            each . . .          declaration of equality
         capable . . . having a property of
       of a peak . . .          peak (not having any higher)
       execution . . .          execution of work (process/code)
            rate . . .          being measured in time [1/s]
      of 2Gflops . . . (quantitative factor) ~ 2E+9 FLOPS

         What is . . . Questioning
 the PERFORMANCE . . . (property)   a term (not defined yet)
   of the SYSTEM . . . (system)     a term (not defined yet)
  as measured in . . . using some measure to evaluate a property of (system) in
          Gflops . . . (units of measure) to express such property in
            when . . . (proposition)
              2% . . . (quantitative factor) ~ 0.02 fraction of
     of the code . . . (subject-being-processed)
              is . . .          has a property of being
      sequential . . .          sequential, i.e. steps follow one-after-another
             and
             98% . . . (quantitative factor) ~ 0.98 fraction of (code)
( the same code)
              is . . .          has a property of being
  parallelizable . . .          possible to re-factor
                                            into some other form,
                                                      from a (sequential)
                                                      original form

( emphasis added )

Fact #1 )
the processor(b) ( a (device) ), from which an introduced multiprocessor ( a macro-(device) ) is internally composed from, has a declared (granted) property of not being able to process more FLOPS, than the said 2 GFLOPS.

This property does not say, how many actual { INTOPS | FLOPS } it will perform in any particular moment in time.

This property does say, any device, that was measured and got labeled to have indeed X {M|G|P|E}FLOPS has the very same "glass-ceiling" of not being able to perform a single more instruction per second, even when it is doing nothing at all (chopping NOP-s) or even when it is switched off and powered down.

This property is a static supreme, an artificial (in relation to real-world work-loads' instruction mixes), temperature-dependent-constant (and often degrades in-vivo not only due to thermal throttling but due to many other reasons in real-world { processor + !processor }-composed SYSTEM ecosystems )

Fact #2 )
the problem, as visible to us here, has no particular definition of what is or what is not a part of the said "SYSTEM" - Is it just the (multi)processor - if so, then why introducing a new, not yet defined, term SYSTEM, for being it a pure identity with the already defined & used term (multi)processor per se? Is it both the (multi)processor and memory or other peripherals - if so, the why we know literally nothing about such important neighbourhood (a complement) of the said (multi)processor, without which a SYSTEM would not be The SYSTEM, but a mere part of it, the (multi)processor, that is NOT a SYSTEM without its (such a SYSTEM-defining and completing) neighbourhood?

Fact #3 )
the original Amdahl's Law, often dubbed as The Law of Diminishing Returns (of extending the System with more and more resources) speaks about SYSTEM and its re-organised forms, when comparing the same amount and composition of work, as performed in original SYSTEM (with a pure-[SERIAL] flow of operations, one-step-after-another-after-another), with another, improved SYSTEM' (created by re-organising and extending the original SYSTEM by adding more resources of some kinds and turning such a new SYSTEM' into operating more parts of the original work-to-be-done in an improved organisation of work, where more resources can & do perform parts of work-to-be-done independently one on any other one ~ in a concurrent, some parts even in a true parallel fashion, using all degrees of parallelism the SYSTEM' resources can provide & sustain to serve).

Given no particular piece of information was present about a SYSTEM, the less about a SYSTEM', we have no right to use The Law of Diminishing Returns to address the problem, as was defined above. Having no facts does not give us a right to guestimate, the less to turn into feelings-based evidencing, if we strive to remain serious to ourselves, don't we?

Given (a) and (b) above, the only fair to achieve claim, that indeed holds true, can be to say :

"From what has been defined so far,
we know that such a multiprocessor
will never work on more than 100 x 2 GFLOP per second of time."

There is zero other knowledge to claim a single bit more (and yet we still have to silently assume that such above claimed peak FLOP-s have no side-effect and remain sustainable for at least a one whole second (see remark 2 ) -- otherwise even this claim will become skewed

An extended, stronger version :

"No matter what kind of code is actually being run,
for this, above specified multiprocessor, we cannot say more
than that such a multiprocessor will never work on more than 100 x 2 GFLOPS in any moment of time."


Remarks :

  1. see how this is being so often misused by promotion of "Exaflops performance" by marketing people, when FMUL f8,f8 is being claimed and "sold" to the public as that it "looks" equal as FMUL f512,f512, which it by far is not using the same yardstick to measure, is it?

  2. a similar skewed argument (if not a straight misinformation) has been countless times repeated in a (false) claim, that a world "largest" femtosecond-LASER was capable to emit a light pulse carrying more power than XY-Suns (a-WOW-moment!), without adding, how long did it take to pump-up the energy for a single such femtosecond long ( 1 [fs] ~ 1E-15 [s] ) "packet-of-a-few-photons" ... careful readers have already rejected a-"WOW"-moment artificial stupidity for not being possible to carry such astronomic amount of energy, as an energy of XY-Suns on a tiny, energy poor planet, the less to carry that "over" a wire towards that "superpower" LASER )

halfer
  • 19,824
  • 17
  • 99
  • 186
user3666197
  • 1
  • 6
  • 50
  • 92
  • So, if I understand your answer correctly, your answer to the precise OP's question "*What is the performance of the system as measured in Gflops*" is basically that we cannot answer it (and we can only provide an upper bound of 100x2 Gflops), right? – Jérôme Richard Apr 10 '22 at 11:36
  • If we stay serious & if we keep respecting facts (& given so many *( well,how euphemistic | actually all)* SYSTEM-related details are not specified,a cited theoretical upper bound of 2 GFLOPS a processor (device) can't ever overcome,is the only hard-fact to use.Any 2%:98% relative fractions expressed,yet none-of-either parts may ever meet a single actual FLOP instruction in code throughout the whole experiment,attempts to "derive" multiples of fractions times in-silicon defined upper bound for FLOPS (which needn't ever take place) is anything but science, the less a computer science, is it? – user3666197 Apr 10 '22 at 12:10
  • Last but not least, if speaking performance, details matter. RISC processors were originally introduced as a processor defined such that by each instruction takes the same amount of time ( simplifying a lot the hardware microarchitecture and compiler options for final machine-code micro-scheduling ). Not so the CISC or VLIW processors, where many sophisticated in-silicon details help superscale many-pipelined (beforehand fetched-decoded-(speculatively)-rearanged) micro-sequences of CPU-uops, at a cost to pay some extra latency if speculative-"prediction" failed and such sequence is never used. – user3666197 Apr 10 '22 at 12:48
0

If 2% is the run-time percentage for serial part then you can not surpass 50x speedup. This means you can not surpass 50x gflops of serial version.

If unoptimized program had 2 gflops fully serial then the optimized version with perfect scaling makes 98% of runtime compressed to 0.98%.

2% plus 0.98% is equivalent to ~3% as a new total run time. This means the program is spending 2/3 of the time in serial part and only 1/3 in the parallelized part. If parallel part is 200gflops then you have to average it over the whole 3/3 of the time. 200 gflops for 1 microsecond and 2 gflops for 2 microseconds.

This is roughly equal to 67 gflops. If there is a single-core turbo to boost the serial part, then 20% turbo boost in 2/3 of the time means shaving ~13% less of total run time, hence 15%-20% higher gflops average. Turbo core frequency is important even if it boosts single core.

huseyin tugrul buyukisik
  • 11,469
  • 4
  • 45
  • 97