10

It is commonly said that a static variable initialization is wrapped in an if to prevent it from being initialized multiple times.

For this, and other one-off conditions, it would be more efficient to have the code remove the conditional after the first pass via self modification.

Are C++ compilers allowed to generate such code and if not, why? I heard that it might have a negative impact on the cache, but I don’t know the details.

janekb04
  • 4,304
  • 2
  • 20
  • 51
  • As code and data may exist in different address spaces, there is no guaranteed C++ solution. Allowed to? Sure, UB. – chux - Reinstate Monica Aug 24 '20 at 21:37
  • 1
    Even if compilers were allowed to do this, from a practical implementation HOW do you think they would be able to do it? The initialization of a `static` has to be thread-safe, which means the generated code would have to somehow make sure that another thread doesn't try to access the `static` while the initialization is being modified out. – Remy Lebeau Aug 24 '20 at 21:38
  • While it should be possible, one thing you have to deal with that makes it complicated is threads. You can have N threads call a function that has a static variable and that would be very hard to modify the code they are using. – NathanOliver Aug 24 '20 at 21:39
  • 1
    If the code is hot, or at least in the branch predictor, then after a few calls it will know to skip the initialization check as the condition will never change after it has been initialized. – NathanOliver Aug 24 '20 at 21:41
  • 1
    @RemyLebeau: That's not a hard problem to solve. Start the function with `jmp rel32` or whatever to a "cold" section of code that does mutual exclusion to run the non-constant static initializer in one thread. Once construction is fully done, use an 8-byte atomic CAS or store to replace that 5-byte instruction with different instruction bytes. Possibly just a NOP, or possibly something useful that was done at the top of the "cold" code. Or on non-x86, just a single word store can replace one jump instruction. Of course the large problem is doing this on a system with W^X memory protection. – Peter Cordes Aug 24 '20 at 22:32
  • @PeterCordes And what is to stop *another thread* from executing that same `jmp` instruction before it can be overwritten? – Remy Lebeau Aug 24 '20 at 22:57
  • 1
    @RemyLebeau: Nothing. That's why you put the same mutual exclusion code as normal in the code you jump to, like I said. Its size barely matters for long-term execution speed because it only runs during startup, and it can be evicted from RAM if nothing else in the page is hot (that's why you group "cold" init functions together in a section.) – Peter Cordes Aug 24 '20 at 23:02
  • Please note that most operating systems do not allow self-modifying code. And if the code is running from the Flash memory of a microcontroller, even the hardware makes self-modifying code impossible. If the C++ or C compiler generated self-modifying code, it would not be possible to use this compiler for these operating systems. – Martin Rosenau Aug 25 '20 at 18:49

3 Answers3

9

There's nothing preventing a compiler from implementing what you suggest but it's a rather heavyweight solution to a very minor performance problem.

To implement the self-modifying code the compiler, for a typical C++ implementation running on Windows or Linux, would have to insert code that would change the permissions on the code page(s), modify the code, and then restore the permissions. These operations could easily cost far more cycles than then the implied "if" operation would take over the lifetime of the program.

This would also have the consequence of preventing the modified code pages from being shared between processes. That may seem inconsequential, but compilers often pessimize their code (pretty badly in the case of i386) in order to implement position independent code that can be loaded a different addresses at runtime without modifying the code and preventing sharing of code pages.

As Remy Lebeau and Nathan Oliver mention in comments there are also thread safety issues to consider, but they can probably be dealt with as there various solutions for hot patching executables like this.

Ross Ridge
  • 38,414
  • 7
  • 81
  • 112
7

Yes, that would be legal. ISO C++ makes zero guarantees about being able to access data (machine code) through function pointers cast to unsigned char*. On most real implementations it's well defined, except on pure-Harvard machines where code and data have separate address spaces.

Hot-patching (usually by external tools) is a thing, and is very doable if compilers generate code to make that easy, i.e. the function starts with a long-enough instruction that can be atomically replaced.

As Ross points out, a major obstacle to self-modification on most C++ implementations is that they make programs for OSes that normally map executable pages read-only. W^X is an important security feature to avoid code-injection. Only for very long-running programs with very hot code paths would it be overall worth it to make necessary system calls to make the page read+write+exec temporary, atomically modify an instruction, then flip it back.

And impossible on systems like OpenBSD that truly enforce W^X, not letting a process mprotect a page with both PROT_WRITE and PROT_EXEC. Making a page temporarily non-executable doesn't work if other threads can call the function at any moment.

It is commonly said that a static variable initialization is wrapped in an if to prevent it from being initialized multiple times.

Only for non-constant initializers, and of course only for static locals. A local like static int foo = 1; will compile the same as at global scope, to a .long 1 (GCC for x86, GAS syntax) with a label on it.

But yes, with a non-constant initializer, compilers will invent a guard variable they can test. They arrange things so the guard variable is read-only, not like a readers/writers lock, but that does still cost a couple extra instructions on the fast path.

e.g.

int init();

int foo() {
    static int counter = init();
    return ++counter;
}

compiled with GCC10.2 -O3 for x86-64

foo():             # with demangled symbol names
        movzx   eax, BYTE PTR guard variable for foo()::counter[rip]
        test    al, al
        je      .L16
        mov     eax, DWORD PTR foo()::counter[rip]
        add     eax, 1
        mov     DWORD PTR foo()::counter[rip], eax
        ret

.L16:  # slow path
   acquire lock, one thread does the init while the others wait

So the fast path check costs 2 uops on mainstream CPUs: one zero-extending byte load, one macro-fused test-and-branch (test + je) that's not-taken. But yes, it has non-zero code-size for both L1i cache and decoded-uop cache, and non-zero cost to issue through the front-end. And an extra byte of static data that has to stay hot in cache for good performance.

Normally inlining makes this negligible. If you're actually calling a function with this at the start often enough to matter, the rest of the call/ret overhead is a larger problem.

But things aren't so nice on ISAs without cheap acquire loads. (e.g. ARM before ARMv8). Instead of somehow arranging to barrier() all threads once after initializing the static variable, every check of the guard variable is an acquire load. But on ARMv7 and earlier, that's done with a full memory barrier dmb ish (data memory barrier: inner shareable) that includes draining the store buffer, exactly the same as for atomic_thread_fence(mo_seq_cst). (ARMv8 has ldar (word) / ldab (byte) to do acquire loads, making them nice and cheap.)

Godbolt with ARMv7 clang

# ARM 32-bit clang 10.0 -O3 -mcpu=cortex-a15
# GCC output is even more verbose because of Cortex-A15 tuning choices.
foo():
        push    {r4, r5, r11, lr}
        add     r11, sp, #8
        ldr     r5, .LCPI0_0           @ load a PC-relative offset to the guard var
.LPC0_0:
        add     r5, pc, r5
        ldrb    r0, [r5, #4]           @ load the guard var
        dmb     ish                    @ full barrier, making it an acquire load
        tst     r0, #1
        beq     .LBB0_2                @ go to slow path if low bit of guard var == 0
.LBB0_1:
        ldr     r0, .LCPI0_1           @ PC-relative load of a PC-relative offset
.LPC0_1:
        ldr     r0, [pc, r0]           @ load counter
        add     r0, r0, #1             @ ++counter leaving value in return value reg
        str     r0, [r5]               @ store back to memory, IDK why a different addressing mode than the load.  Probably a missed optimization.
        pop     {r4, r5, r11, pc}      @ return by popping saved LR into PC

But just for fun, let's look at exactly how your idea could be implemented.

Assuming you can PROT_WRITE|PROT_EXEC (to use POSIX terminology) a page containing the code, it's not a hard problem to solve for most ISAs, such as x86.

Start the function with jmp rel32 or whatever to a "cold" section of code that does mutual exclusion to run the non-constant static initializer in one thread. (So if you do have multiple threads start to run it before one finishes and modifies the code, it all works the way it does now.)

Once construction is fully done, use an 8-byte atomic CAS or store to replace that 5-byte instruction with different instruction bytes. Possibly just a NOP, or possibly something useful that was done at the top of the "cold" code.

Or on non-x86 with fixed-width instructions of the same width it can atomically store, just a single word store can replace one jump instruction.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • 1
    Hmm... I guess on OpenBSD you'd need copy the page to be modified to a new page, modify the new page and then map the new page over the old. – Ross Ridge Aug 25 '20 at 00:05
  • @RossRidge: Interesting idea, yeah probably better than suspending all other threads while you modify a page with this one. In practice raising the cost of doing this means that not self-modifying in the first place is more attractive. – Peter Cordes Aug 25 '20 at 00:09
6

Back in the olden days, the 8086 processor didn’t know anything about floating-point math. You could add a math coprocessor, the 8087, and write code that used it. Fo-code consisted of “trap” instructions that transferred control to the 8087 to execute a floating-point operation.

Borland’s compiler could be set to generate floating-point code that detected at runtime whether there was a coprocessor installed. The first time each fp instruction was executed, it would jump to an internal routine that would backpatch the instruction, with an 8087 trap instruction (followed by a couple of NOPs) if there was a coprocessor, and a call to an appropriate library routine if there wasn’t. Then the internal routine would jump back to the patched instruction.

So, yes, I can be done. Sort of. As various comments have pointed out, modern architectures make this kind of thing hard or impossible.

Earlier versions of Windows had a system call that re-mapped memory segment selectors between data and code. If you called PrestoChangoSelector (yes, that was its name) with a data segment selector it would give you back a code segment selector that pointed at the same physical memory, and vice versa.

Pete Becker
  • 74,985
  • 8
  • 76
  • 165
  • 1
    IIRC, another example was `geninterrupt(n)`, which was supposed to generate software interrupt vector `n`. Since the `INT` instruction on 8086 only takes the vector as an immediate, this was implemented with self-modifying code. – Nate Eldredge Aug 25 '20 at 00:35
  • @NateEldredge -- yup. `geninterrupt(n)` created an opcode of 0xCC, followed by `n`. And that had to be executable code. That's where I learned about `PrestoChangoSelector`. – Pete Becker Aug 25 '20 at 12:32
  • @NateEldredge -- but, on reflection, `geninterrupt(n)` wasn't self-modifying code, just code generation on the fly. It built the INT instruction in the data segment, then made that segment executable. – Pete Becker Aug 25 '20 at 12:39
  • I'm thinking back to 8086 itself (or real mode 386), where there was no memory protection, so everything was always executable. I do seem to recall it was done by modifying code in place; if the INT instruction was built somewhere else, there'd be an unnecessary jump. – Nate Eldredge Aug 25 '20 at 15:01
  • @NateEldredge -- could be that I'm misremembering. – Pete Becker Aug 25 '20 at 15:59