0

I am going to analyse and optimize some C-Code and therefore I first have to check, whether the functions I want to optimize are memory-bound or cpu-bound. In general I know, how to do this, but I have some questions about counting Floating Point Operations and analysing the size of data, which is used. Look at the following for-loop, which I want to analyse. The values of the array are doubles (that means 8 Byte each):

for(int j=0 ;j<N;j++){
    for(int i=1 ;i<Nt;i++){
        matrix[j*Nt+i] = matrix[j*Nt+i-1] * mu + matrix[j*Nt+i]*sigma;
    }
}

1) How many floating point operations do you count? I thought about 3*(Nt-1)*N... but do I have to count the operations within the arrays, too (matrix[j*Nt+i], which are 2 more FLOP for this array)?

2)How much data is transfered? 2* ((Nt-1)*N)8Byte or 3 ((Nt-1)*N)*8Byte. I mean, every entry of the matrix has to be loaded. After the calculation, the new values is saved to that index of the array (now these is 1load and 1 store). But this value is used for the next calculation. Is another load operations needed therefore, or is this value (matrix[j*Nt+i-1]) already available without a load operation?

Thx a lot!!!

knacker123
  • 79
  • 9
  • If your first question asks whether or not to include the index calculations into the total of flops performed the answer is no, index calculations are integer operations. – High Performance Mark May 09 '13 at 07:37
  • Thx, you are right. I first didn't realize the difference between Integer operations and FLOPs. But what about 2). I thought about it last week: usually 2 load and 1 store operation is needed for every iteration. But in my opinion the stored value should be in on-chip memory, when it is used in the next loop iteration, so that no transfer from off-chip memory is needed. Am i right? – knacker123 May 14 '13 at 16:14

1 Answers1

0

With this type of code, the direct sort of analysis you are proposing to do can be almost completely misleading. The only meaningful information about the performance of the code is actually measuring how fast it runs in practice (benchmarking).

This is because modern compilers and processors are very clever about optimizing code like this, and it will end up executing in a way which is nothing like your straightforward analysis. The compiler will optimize the code, rearranging the individual operations. The processor will itself try to execute the individual sub-operations in parallel and/or pipelined, so that for example computation is occurring while data is being fetched from memory.

It's useful to think about algorithmic complexity, to distinguish between O(n) and O(n²) and so on, but constant factors (like you ask about 2*... or 3*...) are completely moot because they vary in practice depending on lots of details.

Kevin Reid
  • 37,492
  • 13
  • 80
  • 108
  • I would disagree with the first paragraph. The fact that the roofline model (that the question seems to be about) is employed in many areas of HPC shows that it is still possible to get meaningful (though not exact) performance predictions from really simple theoretical considerations. – Hristo Iliev Sep 13 '14 at 15:40