13

Moving a member variable to a local variable reduces the number of writes in this loop despite the presence of the __restrict keyword. This is using GCC -O3. Clang and MSVC optimise the writes in both cases. [Note that since this question was posted we observed that adding __restrict to the calling function caused GCC to also move the store out of the loop. See the godbolt link below and the comments]

class X
{
public:
    void process(float * __restrict d, int size)
    {
        for (int i = 0; i < size; ++i)
        {
            d[i] = v * c + d[i];
            v = d[i];
        }
    }

    void processFaster(float * __restrict d, int size)
    {
        float lv = v;
        for (int i = 0; i < size; ++i)
        {
            d[i] = lv * c + d[i];
            lv = d[i];
        }
        v = lv;
    }

    float c{0.0f};
    float v{0.0f};
};

With gcc -O3 the first one has an inner loop that looks like:

.L3:
  mulss xmm0, xmm1
  add rdi, 4
  addss xmm0, DWORD PTR [rdi-4]
  movss DWORD PTR [rdi-4], xmm0
  cmp rax, rdi
  movss DWORD PTR x[rip+4], xmm0        ;<<< the extra store
  jne .L3
.L1:
  rep ret

The second here:

.L8:
  mulss xmm0, xmm1
  add rdi, 4
  addss xmm0, DWORD PTR [rdi-4]
  movss DWORD PTR [rdi-4], xmm0
  cmp rdi, rax
  jne .L8
.L7:
  movss DWORD PTR x[rip+4], xmm0
  ret

See https://godbolt.org/g/a9nCP2 for the complete code.

Why does the compiler not perform the lv optimisation here?

I'm assuming the 3 memory accesses per loop are worse than the 2 (assuming size is not a small number), though I've not measured this yet.

Am I right to make that assumption?

I think the observable behaviour should be the same in both cases.

JCx
  • 2,689
  • 22
  • 32
  • Probably related to aliasing. Making the members (data and functions) of X global seems to help. – Marc Glisse Oct 23 '17 at 11:03
  • i.e. taking them out of the class? Adding the __restrict removed one of the memory reads but didn't remove this write..which is confusing. – JCx Oct 23 '17 at 11:05
  • Adding restrict on the argument of f_original also seems to help, maybe it gets lost during inlining. – Marc Glisse Oct 23 '17 at 11:09
  • Ok - mystery solved them. Also, clang seems to get the right answer: https://godbolt.org/g/79aEy8 with loop unrolling disabled for clarity. – JCx Oct 23 '17 at 11:12
  • Isn't `movss DWORD PTR x[rip+4], xmm0` redundant in anycase? I would assume it would be removed from both versions. – kabanus Oct 23 '17 at 11:12
  • Well it's needed at the end in case the function is called a second time... – JCx Oct 23 '17 at 11:18
  • As a side note, Clang (trunk) both lifts the store out of the loop and performs factor four loop unrolling. – Arne Vogel Oct 23 '17 at 14:12
  • Yes - I think Clang is on the money. And it looks like MSVC sorts it out okay too - maybe this is just a gcc problem. – JCx Oct 23 '17 at 14:31

3 Answers3

3

This seems to be caused by the missing __restrict qualifier on the f_original function. __restrict is a GCC extension; it is not quite clear how it is expected to behave in C++. Maybe it is a compiler bug (missed optimization) that it appears to disappear after inlining.

Florian Weimer
  • 32,022
  • 3
  • 48
  • 92
0

The two methods are not identical. In the first, the value of v is updated multiple times during the execution. That may be or may not be what you want, but it is not the same as the second method, so it is not something the compiler can decide for itself as a possible optimization.

vordhosbn
  • 333
  • 4
  • 16
  • 5
    But the only one reading the updated value of `v` in the meantime is the `process` function itself, so the compiler is free to put it into a register. Note that multithreading does not come into play, because in the absence of any synchronization primitives, there's no guarantee when updates to `v` become visible to other threads. – Thomas Oct 23 '17 at 10:54
  • And the compiler has already decided it doesn't need to re-read the variable every loop... it just writes it every loop which is what's even more mysterious to me – JCx Oct 23 '17 at 10:56
  • @Thomas: I'd put it like this: Due to the absence of ``volatile`` and thread synchronization, the two methods *are* equivalent according to the as-if rule. – Arne Vogel Oct 23 '17 at 14:11
0

The restrict keyword says there is no aliasing with anything else, in effect same as if the value had been local (and no local had any references to it).

In the second case there is no external visible effect of v so it doesn't need to store it.

In the first case there is a potential that some external might see it, the compiler doesn't at this time know that there will be no threads that could change it, but it knows that it doesn't have to read it as its neither atomic nor volatile. And the change of d[] another externally visible variable make the storing necessary.

If the compiler writers reasoning, well neither d nor v are volatile nor atomic so we can just do it all using 'as-if', then the compiler has to be sure no one can touch v at all. I'm pretty sure this will come in one of the new version as there is no synchronisation before the return and this will be the case in 99+% of all cases anyway. Programmers will then have to put either volatile or atomic on variables that are changed, which I think I could live with.

Surt
  • 15,501
  • 3
  • 23
  • 39