0

I came across a problem of calculating CPU cycles needed to execute a code. The code is below

for (int i = 0; i < 64; i++){
add(2,3);// 1 cycle needed
sub(5,2);// 2 cycles needed
mul(3,4);// 4 cycles needed
}

If we excute this code on a uni processor. Would the total cycles be simply 64x(1+2+4)=448 cycles? How does it differ from executing it on SIMD with 64PEs and how do we calculate the total execution time?

  • I apologize for my ignorance (not familiar with C), but wouldn't calling a method require a lot more than 1-4 cycles? – Neil Mar 01 '16 at 11:38
  • 3
    It is not that easy. First it depends on the CPU and architecture, if it has enough integer execution units it might be able to start all three or even more instructions at once. Then you'll pay more initially for the instruction decoding until the micro op cache kicks in. Plus the branch in there might be predicted incorrectly initially. Long story short, it highly depends on a lot of factors how many cycles it will take. In any case though, there is no instruction that will finish in just one cycle on any modern CPU. – JustSid Mar 01 '16 at 11:39
  • Keep in mind that the CPU pipeline is quite long (and it again depends on some factors, ie if the micro op cache can help out, or what kind of instruction it is). You might get e throughput of up to 4 instructions per cycle, but the instructions themselves will still take let's say 14 cycles. – JustSid Mar 01 '16 at 11:42
  • And lastly, agner has some good free reference material about how long instructions take to execute depending on various factors. I'd suggest checking those out, including the one about CPU architecture as it includes infos about the pipelines and execution units and how instructions can be paired to achieve higher throughput. – JustSid Mar 01 '16 at 11:44
  • @Neil: Calling a _method_ would possibly take longer than a _function_. **Iff** both would actually be called. Anyway, C does not support _methods_, so the point is even more irrelevant. As the term "CPU cycles" is in no way related to the C language, one can easily define one _function_ call to be one cycle. – too honest for this site Mar 01 '16 at 11:54
  • @JustSid:"there is no instruction that will finish in just one cycle on any modern CPU" is very wrong. Every instruction **finishes** in one cycle. It just does not execute in a single cycle due to high clock rate requiring pipelining. – too honest for this site Mar 01 '16 at 11:56
  • The issue here is not about how many cycles are needed. It's just random values given to number of cycles. Real issue is the technique used to calculate. Is the technique right? – Fahad Zulfiqar Mar 01 '16 at 11:56
  • 1
    @Olaf I was the first to admit that I was unfamiliar with C, so frankly I don't see the need to jab at my terminology. It is a legitimate question, and one that I still don't have the answer to. – Neil Mar 01 '16 at 11:59
  • @Olaf in that case finishes doesn't mean the final step in the retirement station, but rather every step from fetching to retirement. I probably should have used a better wording there, although I firmly believe your "finishes" should be called "retired" (since we are at the hair splitting of words) – JustSid Mar 01 '16 at 12:00
  • 1
    @FahadZulfiqar as I mentioned, just summing up numbers will not hold on any modern CPU (read: Sometimes build in the past 2 to 3 decades). It highly depends on the context they are executed in. The only way to measure is to profile the actual number of cycles used by the CPU, for that you can use the performance counters (although they too have some problems) – JustSid Mar 01 '16 at 12:02
  • @JustSid: The translations of "retired" to my native language which I know don't seem to fit. "finish" OTOH does. So, I'd stick with "finish". But maybe you can agree about "complete" as an alternative. – too honest for this site Mar 01 '16 at 12:04
  • @Neil I would assume in this case the author means inlined function calls that don't have any overhead. But otherwise, yes, a function call in C has a non-zero overhead (how much depends on the CPU. On x86 it's quite expensive since it's a branch and writing to the stack, but there are architectures where it's cheaper) – JustSid Mar 01 '16 at 12:06
  • @Olaf the final step of making the side effects visible and in effect completing the instruction is however called retiring and hppens in the retirement unit, at least in Intel speak. See [this](http://stackoverflow.com/questions/22368835/what-does-intel-mean-by-retired) SO question and answer. – JustSid Mar 01 '16 at 12:09
  • @JustSid: I'm not used to Intel speak. Sorry, I might have missed the point it became a must-know. Note there are decades much more non-x86 CPUs than x86 compatibles, so this is not the major architecture (I'm quite confident it still is bad ol' 8051 and its derivates). – too honest for this site Mar 01 '16 at 12:22

2 Answers2

3

Without knowing more about the specific architecture it is impossible to answer this question.

Assuming that this is kind a academic problem i would answer:

"uni processor": 64 x (1 + 2 + 4 + loop_overhead) + loop_init = total_cycles
   loop_init: int i = 0 -> probably 1 cycle
   loop_overhead: i++; i < 64 -> probably 2 cycles

"SIMD": 1 + 2 + 4 = total_cycles
qRayner
  • 56
  • 6
1

The number of cycles will vary with CPU. So will the means of calculating them.

"CPU cycle" is rather loosely defined as "the time needed for one simple processor operation", with addition often considered as representative of the "simple processor operation". Sometimes "CPU cycle" is specified as the reciprocal of clock rate. The two definitions might often be close, but are not necessarily equivalent.

Even if you get over the ambiguity of what "CPU cycle" means, no code will simply do an addition (or a subtraction or a multiplication) in isolation. There will be things like evaluating or fetching values of operands, as well - which may or may not be counted, and the actual duration of each operation (or instruction) varies with CPU.

And then there are CPU features like pipelining (so one operation may be commenced while a preceding one is partially completed) which makes measures like CPU cycles meaningless with quite a few modern CPUs.

Peter
  • 35,646
  • 4
  • 32
  • 74