indirection cost ~ 3x of float multiplication, really? (with demo)

Question

I just found that indirection cost around 3 times of float multiplication!
Is it what to be expected? Is my test wrong?

Background

After I read How much does pointer indirection affect efficiency?, I become panic about indirection cost.

Going through a pointer indirection can be much slower because of how a modern CPU works.

Before I prematurely optimize my real code, I want to make sure that it really cost much as I fear.

I do some trick to find the rough number (3x), as below :-

Step 1

Test1 : No indirection -> calculate something
Test2 : Indirection -> calculate something (same)

I found that Test2 takes more time that Test1.
Nothing surprise here.

Step 2

Test1 : No indirection -> calculate something expensive
Test2 : Indirection -> calculate something cheap

I try to change my code in calculate something expensive to be more expensive little by little to make both Test cost near the same.

Result

Finally, I found that one of possible function to make both tests use same amount of time (i.e. break-even) is :-

Test1 : No indirection -> return float*float*... 3 times
Test2 : Indirection -> simply return a float

Here is my test case (ideone demo) :-

class C{
    public: float hello;  
    public: float hello2s[10];  
    public: C(){
        hello=((double) rand() / (RAND_MAX))*10;
        for(int n=0;n<10;n++){
            hello2s[n]= ((double) rand() / (RAND_MAX))*10;
        }
    }
    public: float calculateCheap(){
        return hello;
    }
    public: float calculateExpensive(){
        float result=1;
        result=hello2s[0]*hello2s[1]*hello2s[2]*hello2s[3]*hello2s[4];
        return result;
    }
};

Here is the main :-

int main(){
    const int numTest=10000;
    C  d[numTest];
    C* e[numTest];
    for(int n=0;n<numTest;n++){
        d[n]=C();
        e[n]=new C();
    }
    float accu=0;
    auto t1= std::chrono::system_clock::now();
    for(int n=0;n<numTest;n++){
        accu+=d[n].calculateExpensive();  //direct call
    }
    auto t2= std::chrono::system_clock::now();
    for(int n=0;n<numTest;n++){
        accu+=e[n]->calculateCheap();     //indirect call
    }
    auto t3= std::chrono::system_clock::now();
    std::cout<<"direct call time ="<<(t2-t1).count()<<std::endl;
    std::cout<<"indirect call time ="<<(t3-t2).count()<<std::endl;
    std::cout<<"print to disable compiler cheat="<<accu<<std::endl;
}

Direct call time's and Indirect call time is tuned to be similar as mentioned above (via editing calculateExpensive).

Conclusion

Indirection cost = 3x float multiplication.
In my desktop (Visual Studio 2015 with -O2), it is 7x.

Question

Is it to be expected that indirection cost around 3x of float multiplication?
If no, how is my test wrong?

(Thank enhzflep for suggesting improvement, it is edited.)

One `public:` access specifier is enough. You are also leaking memory with that `new` operator. If you want to use C++ I suggest you forget everything you know about Java. — , May 25 '17 at 03:33
@javaLover - without seeing the code that the compiler generated, I wouldn't be so sure that you're actually comparing the cost of 13 multiplies with that of indirection. Most modern compilers are smart enough to avoid generating code that performs 13 multiply operations from the source you've provided. A better test would be to multiply together different variables, thus forcing the compiler to generate as many multiplies as your source-code contains. E.g: `a*b*c*d*e*f*g*h*i*j*k*l*m*n` — enhzflep, May 25 '17 at 03:50
@enhzflep Oh, my bad. After I improved it, it becomes 3x-6x. (edited-add) Thank. In your opinion, is it a reasonable number now? — javaLover, May 25 '17 at 04:01
It should also be noted that a good part of your cost will not be purely due to indirection but more likely partly due to memory fragmentation. Notice that you call `new C()` 100'000 times. This will create 100'000 instances of C scatter all over you memory. Allocating as an array (new C[numTest]) will likely produce entirely different results. Minor addition: initializing like this `C d[numTest] = {};` will call the constructor on every single element — HeroicKatora, May 25 '17 at 04:04
Possible duplicate of [DRAM cache miss](https://stackoverflow.com/questions/29451066/dram-cache-miss) — autistic, May 25 '17 at 04:04
It seems to me that you might believe that memory is all in the one heirarchy; you might think that which is in *the stack* is the same memory as that which is in *the heap*... That's commonly not the case. The answer to this question has nothing to do with *indirection*, and mostly everything to do with that heirarchy; ***stack* access is commonly cheaper than *heap* access because the *stack* is more likely to be in the cache!** If I'm right, you should lose that homogenous-heirarchy belief, and the terms *stack* and *heap* (they're not necessary). — autistic, May 25 '17 at 04:09
@javaLover I promoted my comment to an answer. Also fragmentation necessarily means more cache-miss since it makes the actual access pattern less linear and less predictable. Another note, multiple very small memory allocations will yield a higher memory usage than one large allocation. This is because there might be additional data stored at the beginning of each block, making access patterns and data locality even worse [see here](https://stackoverflow.com/a/1518718/3750062) — HeroicKatora, May 25 '17 at 04:26
Your question is unclear. Which of the above functions does the indirect call? Which does the direct call?? What is the base case? I am unable to understand the output numbers in your links either — WhiZTiM, May 25 '17 at 04:36
@WhiZTiM I finally make it clearer and more concise. If it is still unclear, please tell me. Thank. — javaLover, May 25 '17 at 04:49
@javaLover When I run this on my own laptop and not in ideone, I get unstable results. Sometimes the indirect is faster. I then added a bunch of repetitions of the tests (http://ideone.com/2y5xhB) and indirection is *always* faster on my laptop. And on ideone I get the reverse. I think ideone may not be a good benchmarking platform. — Shalom Craimer, May 25 '17 at 06:17
@scraimer That is valuable information. Thank. In my desktop, I get 7x result (7 multiplication). It is a jump-scare for me. — javaLover, May 25 '17 at 06:24

HeroicKatora · Answer 1 · 2017-05-25T15:17:47.233

Plainly put, your test is very non-representative and does not actually measure exactly what you might think it does.

Notice that you call new C() 100'000 times. This will create 100'000 instances of C scatter all over you memory, every single one of them very small. Modern hardware is very good at prediction if your memory accesses are regular. Since every allocation, every call to new, happens independently of the others, the memory addresses will not be grouped well together, making this prediction harder. This leads to so called cache misses.

Allocating as an array (new C[numTest]) will likely produce entirely different results, since the addresses are very predictable again in this case. Grouping your memory together as closely as possible and accessing it in a linear, predictable fashion will generally give much better performance. This is because most caches and address prefetcher expect exactly this pattern to occur in common programs.

Minor addition: initializing like this C d[numTest] = {}; will call the constructor on every single element

I just test the assumption about memory fragmentation (avoid new()). The result indicates that performance lose may not due to `new()` 100000 times or memory fragmentation. (but may from cache-miss) http://ideone.com/p7QpcY (result=3-4x) ..... Yes, create it with `new C[numTest]` and access it *sequentially* does make the cost = 1 multiplication. — javaLover, May 25 '17 at 04:26
@javaLover please see my answer on the other comment thread, this interpretation is not proof of any of those statements. Also be aware that the cost of new() might be **highly** system dependent. — HeroicKatora, May 25 '17 at 04:29

score 3 · Answer 2 · answered May 25 '17 at 04:03

There is not a simple answer to your question. It depends on the capabilities and features of your hardware (CPU, RAM, bus speeds, etc).

Back in the old days, floating point multiplies could take dozens if not hundreds of cycles. Memory accesses were at the speeds similar to the CPU frequency (think MegaHertz here), and a floating point multiply would take longer than an indirection.

Things have changed greatly since then. Modern hardware can perform floating point multiplies in just a cycle or two, while indirection (memory access) can take just a few cycles to hundreds, depending on where the data to be read is located. There can be several levels of cache. In extreme cases, the memory accessed via indirection has been swapped to disk and needs to be read back in. This would have a latency of thousands of cycles.

Generally, the overhead in fetching operands for floating point multiplies and decoding the instruction can take longer than the actual multiply.

score 3 · Accepted Answer · answered May 26 '17 at 12:29

The cost of indirection is dominated by Cache Misses. Because honestly, Cache Misses are so much more expensive than anything else you are talking about, everything else ends up being rounding error.

Cache Misses and indirection can be far more expensive than your test indicates.

This is mostly because you have a mere 100,000 elements, and a CPU cache can cache every one of those floats. The sequential heap allocations will tend to clump.

You'll get a pile of cache misses, but not one for every element.

Both of your cases are indirect. The "indirect" case has to follow two pointers, and the "direct" case has to do one instance of pointer arithmetic. The "expensive" case may be suitable for some SIMD, especially if you have relaxed floating point accuracy (allowing multiplication reordering).

As seen here, or this image (not inline, I lack rights), the number of main memory references are going to dominate over almost anything else in the above code. A 2 Ghz CPU has a 0.5 ns cycle time, and a main memory reference is 100 ns or 200 cycles of latency.

Meanwhile, a desktop CPU can hit 8+ floating point operations per cycle if you can pull off vectorized code. That is a potential 1600x faster floating point operation than a single cache miss.

Indirection can cost you being able to use vectorized instructions (8x slowdown), and if everything is in cache can still require L2 cache references (14x slowdown) more often than the alternative. But these slowdowns are small compared to the 200 ns main memory reference delay.

Note that not all CPUs have the same level of vectorization, that some effort is being put into speeding up CPU/main memory delay, that FPUs have different characteristics, and a myriad of other complications.

Thank, `The "indirect" case has to follow two pointers` enlighten me. Hmm, it is interesting that every good information table like this always lacks cost of arithmetic operations. — javaLover, May 26 '17 at 12:50
@java 1 cycle or less; with vectorization you can do many/cycle. How many exactly is tricky. Hyperthreading makes it tricky. TPD resulting in downclocking also makes it tricky. But "1 cycle" is a good spot to start. — Yakk - Adam Nevraumont, May 26 '17 at 14:32