22

I recently figured out that using a placement new is faster than doing 16 assignments:
Consider the following piece of code (c++11):

class Matrix
{
public:
    double data[16];

    Matrix() : data{ 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1 }
    {
    };

    void Identity1()
    {
        new (this) Matrix();
    };

    void Identity2()
    {
        data[0]  = 1.0; data[1]  = 0.0; data[2]  = 0.0; data[3]  = 0.0;
        data[4]  = 0.0; data[5]  = 1.0; data[6]  = 0.0; data[7]  = 0.0;
        data[8]  = 0.0; data[9]  = 0.0; data[10] = 1.0; data[11] = 0.0;
        data[12] = 0.0; data[13] = 0.0; data[14] = 0.0; data[15] = 1.0;
    };
};

Usage:

Matrix m;
//modify m.data

m.Identity1(); //~25 times faster
m.Identity2();

On my machine Identity1() is about 25 times faster than the second function. And now im curious why there is such a big difference?

I also tried a third one:

void Identity3()
{
    memset(data, 0, sizeof(double) * 16);
    data[0] = 1.0;
    data[5] = 1.0;
    data[10] = 1.0;
    data[15] = 1.0;
};

But this is even slower than Identity2() and i can't imagine why.


Profiling information

I have done several profiling tests to see if it's an profiling-related issue, so there is the default 'for loop' test but also external profiling tests:

Profiling method 1: (the well known for loop test)

struct timespec ts1;
struct timespec ts2;

clock_gettime(CLOCK_MONOTONIC, &ts1);

for (volatile int i = 0; i < 10000000; i++)
    m.Identity(); //use 1 or 2 here

clock_gettime(CLOCK_MONOTONIC, &ts2);

int64_t start = (int64_t)ts1.tv_sec * 1000000000 + (int64_t)ts1.tv_nsec;
int64_t elapsed = ((int64_t)ts2.tv_sec * 1000000000 + (int64_t)ts2.tv_nsec) - start;

if (elapsed < 0)
    elapsed += (int64_t)0x100000 * 1000000000;

printf("elapsed nanos: %ld\n", elapsed);

Method 2:

$ valgrind --tool=callgrind ./testcase

$ # for better overview:
$ python2 gprof2dot.py -f callgrind.out.22028 -e 0.0 -n 0.0 | dot -Tpng -o tree.png

Assembly information

As the user T.C. stated in the comments, this might be helpful:

http://goo.gl/LC0RdG


Compilation and machine info

Compiled with: g++ --std=c++11 -O3 -g -pg -Wall

-pg is not the issue. Got the same time-difference in measurement method 1 without using this flag.

Machine info (lscpu):

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                8
On-line CPU(s) list:   0-7
Thread(s) per core:    2
Core(s) per socket:    4
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 58
Model name:            Intel(R) Core(TM) i7-3612QM CPU @ 2.10GHz
Stepping:              9
CPU MHz:               2889.878
CPU max MHz:           3100.0000
CPU min MHz:           1200.0000
BogoMIPS:              4192.97
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              6144K
NUMA node0 CPU(s):     0-7
bricklore
  • 4,125
  • 1
  • 34
  • 62
  • 8
    I'm suspicious of your benchmarks! – Lightness Races in Orbit Aug 26 '15 at 09:57
  • 2
    You *are* profiling an optimised build, right? And how are you doing the measurements? – Angew is no longer proud of SO Aug 26 '15 at 09:58
  • edited post, it's -O3 - but thats making it even more intresting – bricklore Aug 26 '15 at 09:59
  • Did you try `*this = Matrix();` instead of placement new? i.e. relying on the built-in assignment – Jonathan Wakely Aug 26 '15 at 10:00
  • 3
    Can you post the code of the benchmark, please? – Paolo M Aug 26 '15 at 10:01
  • 1
    @JonathanWakely Maybe you could ask details and wait say an hour before trying to close? I don't understand the close frenzy of questions that aren't FUBAR. – curiousguy Aug 26 '15 at 10:08
  • 4
    You should compare the asm output. – curiousguy Aug 26 '15 at 10:09
  • 1
    @curiousguy: No offence, but 1 hour is simply unrealistic. The OP already did benchmarks, how long does it take to copy-paste the code? – Karoly Horvath Aug 26 '15 at 10:10
  • 1
    @curiousguy, it still doesn't meet the requirements of the minimum code necessary to reproduce the "problem" so a close vote is entirely appropriate. There are 10 million questions on this site, poor ones should be closed. – Jonathan Wakely Aug 26 '15 at 10:10
  • 1
    @JonathanWakely I agree that the question gives insufficient context information. **Any performance question should include compiler, compiling options, and target machine.** It isn't a dead question, the poster is responding, he has added information, so I feel it is way too early to close as the question is not inept, just incomplete. I agree there are many inept questions that show no research effort whatsoever and *when the poster isn't trying to fix his question the question should be closed*. – curiousguy Aug 26 '15 at 10:15
  • Target machine infos added. see edit, profiling info will follow soon – bricklore Aug 26 '15 at 10:16
  • asm output for second one takes almost 5 times more instructions. Or you also want to know why is that? – SChepurin Aug 26 '15 at 10:19
  • @curiousguy, questions can be reopened too if they are closed and subsequently improved. My close vote was within the guidelines of the site, get over it. – Jonathan Wakely Aug 26 '15 at 10:19
  • 2
    @JonathanWakely And guidelines can be interpreted. Get over it. – curiousguy Aug 26 '15 at 10:22
  • @KarolyHorvath "_No offence, but 1 hour is simply unrealistic_" No offence, but 10 minutes is simply realistic. – curiousguy Aug 26 '15 at 10:24
  • 2
    @curiousguy, take it to chat or meta, not here. "I don't like how you interpret the guidelines, interpret them the way I do" is off-topic and quite silly. And it **still** doesn't provide the code half an hour later. – Jonathan Wakely Aug 26 '15 at 10:25
  • I don't know how much harm the debugging and profiling info does, but it's worth a try to remove the flags. – Karoly Horvath Aug 26 '15 at 10:28
  • @KarolyHorvath, `-g` does absolutely no harm to run-time performance. `-pg` does. – Jonathan Wakely Aug 26 '15 at 10:28
  • The latter one obviously, you have to count the number of calls. Thx. – Karoly Horvath Aug 26 '15 at 10:29
  • no. I added `-pg` just to do the real profiling. That was after i encountered the issue. When using no external profiling application, (measuring time with repeatedly executing the same piece of code) the same difference can be seen. – bricklore Aug 26 '15 at 10:31
  • 2
    There does seem to be [some difference in codegen](http://goo.gl/JucPm5). – T.C. Aug 26 '15 at 10:41
  • @curiousguy profiling information updated. – bricklore Aug 26 '15 at 10:41
  • 2
    @JonathanWakely So 1 hour might not be **that** unrealistic, I'm sorry. But getting all those thing's together is no 10 minute thing, especially not because I didn't expect so much feedback in no time! But please see the infos that I have added. – bricklore Aug 26 '15 at 10:48
  • 1
    In my experiment the difference between Identity1 and Identity2 is negligible, Identity3 is about twice as slow. – n. m. could be an AI Aug 26 '15 at 11:15
  • So it might be a machine-dependent thing, on my computer (Infos in the question) `Identity2` is ~25 times slower. Of couse it's in a very small time range, but it is a constant factor and no execution time variance. – bricklore Aug 26 '15 at 11:32
  • One interresting thing is that the assembler of the two methods actually differ that much. Is there a reason for that? – skyking Aug 26 '15 at 11:41
  • @skyking The initialiser is translated and optimised as a whole. The assignment are translated one by one and slightly optimised. – curiousguy Aug 26 '15 at 11:47
  • gcc trunk generates the same code for Identity1 and Identity2. – Marc Glisse Aug 26 '15 at 11:49
  • @curiousguy I hade the impression that GCC was better at optimize than that. I might have guessed that there were a "fine-print" difference in semantics of the functions that required the resulting code to be different. – skyking Aug 26 '15 at 11:50
  • Since you are not actually reading the values with Identity1, the entire initializer may have been optimized out. In the case of Identity3, the optimizer may refrain from such eliminations because the memory locations are overlapping. I suspect the out-of-order addressing may cause slight performance differences compared to Identity2, since you're busting the cache page lines size. – StarShine Aug 26 '15 at 23:38
  • On my machine, using gcc 4.9.2 and your 'profiling method 1' code, Identity2 is only slightly slower (less than 10% -- less than the variance from run to run) – Chris Dodd Aug 27 '15 at 02:48
  • @ChrisDodd So as I said, seems to be highly machine dependent. And StarShine, if you have a look at the assembly code you can see that it is not optimized out. (In my case) – bricklore Aug 27 '15 at 06:13
  • Can you post the generated machine code for all 3 variants? Let's look under the hood. Also, the compiler can delete the whole loop. Make sure it can't do that (or at least make sure it didn't which we will find out from the disassembly). – usr Aug 27 '15 at 09:03
  • @bricklore: did you ever sort this out? Was it just a case of the first test being slower because of cold caches or CPU frequency scaling? As I said in my answer, my test showed that your test loop was just testing an empty loop with a `volatile` loop counter. `Identity1` and `Identity2` compiled to somewhat different (sub-optimal) code, but they were never called. – Peter Cordes Feb 16 '16 at 22:23

3 Answers3

2

Whatever your 25x time difference was measuring, it's not actually the difference between the two Identity() implementations.

With your timing code, both versions compile to exactly the same asm: an empty loop. The code you posted never uses m, so it gets optimized away. All that happens is loads/stores of the loop counter. (This happens because you used volatile int, to tell gcc that the variable is stored in memory-mapped I/O space, so all reads/writes of it appearing in the source must actually appear in the asm. MSVC has a different meaning for the volatile keyword, which goes beyond what the standard says.)

Have a look at the asm on godbolt. Here is your code, and the asm it turns into:

for (volatile int i = 0; i < 10000000; i++)
    m.Identity1();
// same output for gcc 4.8.2 through gcc 5.2.0, with -O3

# some setup before this loop:  mov $0, 8(%rsp)  then test if it reads back as 0
.L16:
    movl    8(%rsp), %eax
    addl    $1, %eax
    movl    %eax, 8(%rsp)
    movl    8(%rsp), %eax
    cmpl    $9999999, %eax
    jle .L16

  for (volatile int i = 0; i < 10000000; i++)
    m.Identity2();

# some setup before this loop:  mov $0, 12(%rsp)  then test if it reads back as 0
.L15:
    movl    12(%rsp), %eax
    addl    $1, %eax
    movl    %eax, 12(%rsp)
    movl    12(%rsp), %eax
    cmpl    $9999999, %eax
    jle .L15

As you can see, neither one calls either version of the Identity() function.

It's interesting to see in the asm for Identity1 that it uses integer movq for assignment of zeros, while Identity2 only uses scalar FP moves. This may have something to do with using 0.0 vs. 0, or it may be due to in-place new vs. simple assignment.

I see either way, gcc 5.2.0 doesn't vectorize the Identity functions unless you use -march=native. (In which case it uses AVX 32B loads/stores to copy from 4x 32B of data. Nothing clever like byte-shifting the registers to move the 1.0 to a different spot :/)

If gcc was smarter, it would do a 16B store of two zeros, and do that instead of two movsd. Maybe it's assuming unaligned, and the downside for a cacheline or page-line split on an unaligned store is a lot worse than the upside of saving a store insn if it is aligned.


So whatever you timed with that code, it wasn't your functions. Unless one of them did the Identity, and the other didn't. Either way, lose the volatile from your loop counter, that's totally silly. Just look at the extra loads/stores in the empty loops because of it.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • movsd doesn't do zero-extension, and carries the additional dependency on possible out-of-order register writes, making it slightly slower than movd (or movq). So it seems gcc indeed is unable to factor out a certain type of dependencies when optimizing. – StarShine Aug 28 '15 at 10:46
  • It seems that this would be a good test case for the GCC optimizer devs. Maybe it would make sense to report this as a perf issue. – usr Aug 28 '15 at 13:05
  • @StarShine: `movsd %xmm, (mem)` does a 64bit store, exactly the same as `movq`. Only the `movsd r/m, %xmm` form has a false dependency (which `movq` avoids by zeroing the top half, as you say). Interestingly, gcc uses `movq $0, offset(mem)` in the integer version, so there is no register involved. – Peter Cordes Aug 28 '15 at 15:18
  • Correction, `movsd %xmm, %xmm` has a dependency on the upper half of the dest register. The load form `movsd (mem), %xmm` zeros the upper half, avoiding any dependency. Use `movlps` or `movlpd (mem), %xmm` if you want a merge-load. (NB, the store forms of `movlpd` and `movsd` are identical in behaviour, even though they have different opcodes.) Also note that the `movq` is `mov $imm32, m64` with an AT&T-syntax q suffix, not `movq %xmm, (mem)` from a vector reg. – Peter Cordes Feb 16 '16 at 22:09
1

I bet you get the same performance if you memcopy a const-expr array manually:

static constexpr double identity_data[16] = { 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1 };

void Identity3()
{
    std::copy(std::begin(identity_data), std::end(identity_data), data);
}
Thomas B.
  • 2,276
  • 15
  • 24
1

Intrigued by the question, I found a very nice blog post on SSE instructions, discussing the performance of movq and movsd here:

http://www.gamedev.net/blog/615/entry-2250281-demystifying-sse-move-instructions/

Since the second set of instructions [movsd/movsq] don't do zero extension, you might think that they would be slightly faster than ones that have to do the extra filling of zeroes [movd/movq]. However, these instructions can introduce a false dependence on previous instructions, since the processor doesn't know whether you intended to use the extra data you didn't end up erasing. During out-of-order execution, this can cause stalls in the pipeline while the move instruction waits for any previous instructions that have to write to that register. If you didn't actually need this dependence, you've unnecessarily introduced a slowdown into your application.

So the more complex instruction decoding plays ball with the pipeline, where the other instructions have to assume dependencies. The decoding itself is probably just as fast.

Trying out a few things on the assembly page, I was also amazed by how bad a simple memset translates into inline assembly, when all I expected was a simple rep stosq or an unrolled version of that.

StarShine
  • 1,940
  • 1
  • 27
  • 45
  • 1
    `movsd` as a load *does* zero the upper half of the dest register. Also, the `movq` discussed in that blog is the `movq xmm, r/m64` instruction (or `movq xmm, xmm/m64`). Not the scalar integer `mov r/m64, imm32` with an AT&T-syntax q suffix that `Identity1()` compiles to. None of the instructions in either asm output have any false dependencies on vector registers. I think you're right that a `rep stosq` to zero everything and then some scalar stores of the `1.0` values would do pretty well, but the startup overhead for such a short `rep stos` might lose vs. SSE. – Peter Cordes Feb 16 '16 at 22:15
  • 1
    And there is no `movsq`. Maybe you meant to write `movss` / `movsd`? Because the reg-reg forms of those insns do have false-dep problems if the dest register is part of a long dep chain. It's not a problem if the dest register is already "ready". Anyway, you can and should just use `movaps` for reg-reg moves if you don't care about the upper half of the dest register. That allows recent CPUs to handle the move at the register-rename stage (so-called mov-elimination), with zero latency and no execution unit needed. – Peter Cordes Feb 16 '16 at 22:18
  • Oops yes, you are right, apparently I totally misquoted that. Thanks for your clear eye and insight! Do you have a good resource where mov-eliminition characterstics are explained? I understand the idea, but what are the side-effects (if any). – StarShine Feb 17 '16 at 09:27
  • http://agner.org/optimize/ to understand the implications. It's pretty simple: no execution unit, and zero latency. It still takes an issue / retirement slot to track it through the pipeline, though (like an xor-zeroing), so it's still not free if you're anywhere near front-end limits. And of course code-size and uop-cache size aren't affected. IIRC, Agner Fog's guide says something about about reg-reg moves not *always* being eliminated. I'm not sure how to tell when the front-end wouldn't be able to do it at reg-rename time. – Peter Cordes Feb 17 '16 at 10:21
  • Also, don't feel bad: it took me months to realize that `movsd` as a load was different from `movsd xmm, xmm`. For a long time I thought `movq xmm, m64` was "better". I wasn't actually working on any FP code, though, so I could just use `movq` without worrying about asm that looked confusing. – Peter Cordes Feb 17 '16 at 10:24