Does SEH make stores/writes more expensive?

Question

I'm comparing two different methods of buffer checking.

The first method is to check on every iteration if the end of the buffer has been reached, and the second method is to use a guard page to detect the end.

While the guard page method should in theory be faster, this does not appear to be the case.

The disparity between the two is even worse for stores, where the guard page method takes 5x longer than the buffer check method.

What's causing this to happen?

Benchmarks on my machine (averages over 10 trials):

branch + load:
58947659.3
branch + store:
15234306.6
seh + load:
84706608.6
seh + store:
84822314.3

My code:

#include <Windows.h>
#include <stdio.h>

#define BUFFER_SIZE 16ull * 1024ull * 1024ull * 1024ull

//remove this to do stores
#define LOAD

//remove this to use seh
#define USE_BRANCH

int main()
{
    HANDLE consoleHandle = GetStdHandle(STD_OUTPUT_HANDLE);

    char* memory = VirtualAlloc(NULL, BUFFER_SIZE, MEM_COMMIT | MEM_RESERVE, PAGE_READWRITE);
    if (memory == NULL)
        return 0;

    unsigned long long total = 0;
    char* memoryStart = memory;
#ifdef USE_BRANCH
    
    LARGE_INTEGER perfcountBefore;
    QueryPerformanceCounter(&perfcountBefore);

    while (memory < memoryStart + BUFFER_SIZE)
    {
#ifdef LOAD
        total += *memory;
#else
        (*memory)++;
#endif
        memory++;
    }
    
    LARGE_INTEGER perfcountAfter;
    QueryPerformanceCounter(&perfcountAfter);

    char buffer[30];
    int stringlength = _snprintf_s(buffer, 30, _TRUNCATE, "operation took %i\n", perfcountAfter.QuadPart - perfcountBefore.QuadPart);
    WriteConsoleA(consoleHandle, buffer, stringlength, NULL, NULL);
#else
    SYSTEM_INFO si;
    GetSystemInfo(&si);
    DWORD garbage;
    VirtualProtect(memory + BUFFER_SIZE - si.dwPageSize, si.dwPageSize, PAGE_READWRITE | PAGE_GUARD, &garbage);

    LARGE_INTEGER perfcountBefore;
    QueryPerformanceCounter(&perfcountBefore);
    __try
    {
        while (1)
        {
#ifdef LOAD
            total += *memory;
#else
            (*memory)++;
#endif
            memory++;
        }
    }
    __except (EXCEPTION_EXECUTE_HANDLER)
    {
        while (memory < memoryStart + BUFFER_SIZE)
        {
#ifdef LOAD
            total += *memory;
#else
            (*memory)++;
#endif
            memory++;
        }
        LARGE_INTEGER perfcountAfter;
        QueryPerformanceCounter(&perfcountAfter);

        char buffer[30];
        int stringlength = _snprintf_s(buffer, 30, _TRUNCATE, "operation took %i\n", perfcountAfter.QuadPart - perfcountBefore.QuadPart);
        WriteConsoleA(consoleHandle, buffer, stringlength, NULL, NULL);
    }
#endif

    return total;
}

Why do you have the storage loop again within the exception handler? — 500 - Internal Server Error, Mar 23 '23 at 22:41
If you don't get "operation took 0" then you forgot to test the Release build. Perf tests on debug code are not useful. — Hans Passant, Mar 23 '23 at 22:55
@HansPassant I was testing the release build, and why would it say 0? — Badasahog, Mar 23 '23 at 22:56

score 2 · Answer 1 · answered Mar 23 '23 at 23:06

As always with micro-optimizations, you need to take a look at the generated code. For the "normal" loop, you get this:

$LL2@loop:
    movsx   rdx, BYTE PTR [rcx]
    lea     rcx, QWORD PTR [rcx+1]
    add     r9, rdx
    inc     r8
    cmp     r8, r10
    jb      SHORT $LL2@loop

For your SEH loop:

$LL13@loop:
    movsx   rax, BYTE PTR [rcx]
    add     rdx, rax
    mov     QWORD PTR total$1[rsp], rdx
    inc     rcx
    mov     QWORD PTR memory$[rsp], rcx
    jmp     SHORT $LL13@loop

Using a __try block has the side effect that the compiler will consider all memory accesses to have side effects, and your local variables total and memory aren't optimized, generating two more memory accesses. This is actually somewhat sensible; if it didn't assume side effects, the compiler would just see the infinite loop and nop everything.

score 0 · Answer 2 · answered Apr 06 '23 at 16:15

The expectation that SEH is inexpensive is incorrect.

SEH can be inexpensive only if the exception is not raised. Raising the SEH exception is always expensive.

This aligns with the normal practice of using exception only to handle exceptional cases, not normal control flow.

So, you may use this technique of having inaccessible page to ensure if buffer overrun does not happen accidently, but you should not use that for normal checks for the ends of the buffer.

Regarding the use of SEH when the exception is not raised. It can be cheaper than the usual flow in x86-64, when exception metadata is external from the code, and doesn't cause extra instructions to be inserted. In 32-bit x86 SEH is still implemented using extra instructions, that may or may not be cheaper than the usual checks, most likely SEH is more expensive.

Does SEH make stores/writes more expensive?

2 Answers2