Intercalate characters from 5 strings in assembly nasm

Question

I have to code an assembly program that intercalates characters from five different strings that the user types on the keyboard, for example, if I had:

S1 : "Hello"
S2 : "Bye"
S3 : "Apple"
S4 : "Car"
S5 : "Tree"

it would result: "HBACTeyparleprelloe"

This is what I did so far, it can intercalate from stings with the same size, I don't know what to do to make it work for different sized strings and if there's a better way to do it. I would appreciate the help since it is one of my first assembly programs.

segment .data
instruccion db 'Ingrese 5 cadenas de igual longitud no mayores a 20 caracteres:',0x0A
lonI EQU ($-instruccion)

segment .bss
contador resb 1
cad1 resb 20
cad2 resb 20
cad3 resb 20
cad4 resb 20
cad5 resb 20
cad6 resb 101

segment .text
    global _start
_start:
    mov edx,lonI
    mov ecx,instruccion
    call imprimir

    mov edx,20d
    mov ecx,cad1
    call leer

    mov ecx,cad2
    call leer

    mov ecx,cad3
    call leer

    mov ecx,cad4
    call leer

    mov ecx,cad5
    call leer

    mov edi,cad1
    mov ecx,255
    mov eax,0Ah
    repne scasb
    mov eax,255
    inc ecx
    sub eax,ecx

    mov edi,cad6
    mov ecx,eax
    mov ebx,0
    ciclo:
        mov esi,cad1
        cld
        mov edx,ecx
        mov ecx,ebx
        cmp ebx,0
        jne THEN1
        je ELSE1
        THEN1:
            lodsb
            loop THEN1
        ELSE1:
            movsb

        mov esi,cad2
        cld
        mov ecx,ebx
        cmp ebx,0
        jne THEN2
        je ELSE2
        THEN2:
            lodsb
            loop THEN2
        ELSE2:
            movsb

        mov esi,cad3
        cld
        mov ecx,ebx
        cmp ebx,0
        jne THEN3
        je ELSE3
        THEN3:
            lodsb
            loop THEN3
        ELSE3:
            movsb

        mov esi,cad4
        cld
        mov ecx,ebx
        cmp ebx,0
        jne THEN4
        je ELSE4
        THEN4:
            lodsb
            loop THEN4
        ELSE4:
            movsb

        mov esi,cad5
        cld
        mov ecx,ebx
        cmp ebx,0
        jne THEN5
        je ELSE5
        THEN5:
            lodsb
            loop THEN5
        ELSE5:
            movsb

        mov ecx,edx
        inc ebx
    loop ciclo

    mov eax,0Ah
    stosb
    mov edx,101d
    mov ecx,cad6
    call imprimir

    mov eax,1           ;system call number (sys_exit)
    int 0x80            ;call kernel

    leer:
        mov ebx,0
        mov eax,3
        int 0x80
        ret

    imprimir:
        mov ebx,1
        mov eax,4
        int 0x80
        ret

When I was playing with assembly language a while back I would write the program in C first so I had a clear picture of what I wanted to accomplish. Maybe you could do the same. Write a C program that does what you want and test it. Then you could post it with your question and it would be clear what you want your assembler program to do. — Bobby Durrett, Apr 10 '23 at 23:58
That advice is very good. Many people with an assembly project think I don't need it in C, that's a waste of time. But it is very hard to write code in assembly when you don't know what you want the program to do, so that's where pseudo code or better yet, C comes in. C is better, because you can actually run it to make sure it works. There's nothing harder than trying to fix a broken algorithm in assembly. So, doing a C version first ensures that your only assembly issues will be translation of if-then, while/for loops rather than algorithmic issues. — Erik Eidt, Apr 11 '23 at 02:31
I'd implement it with an array of pointers (or an unrolled inner loop with pointers in registers) - for each pointer, check if it points to a terminating `0` byte (`if ( *ptr == 0 )` or `movzx eax, byte [esi]` / `test eax, eax` / `jz`), and if so, skip appending from this one, otherwise advance it and copy the character. Keep iterating the outer loop until the output string stops growing. Tweaks could include shortening the array of pointers once one hits a terminating zero, e.g. taking it out of the rotation and copying the other pointers down to close the gap. — Peter Cordes, Apr 11 '23 at 03:04
If you had a power-of-2 number of input strings, you could interleave them with SIMD like `punpcklbw xmm0, xmm1` with 8 bytes of data from strings A and B (each padded with 0 bytes at the end), `punpcklbw xmm2, xmm3` from C and D, then `punpcklwd` / `punpckhwd` from those pairs. Then to filter 0s and close up the gaps, AVX-512 VBMI2 `vptestmb k1, ymm0, ymm0` / `vpcompressb ymm0{k1}, ymm0`. I guess you could mix in 3 vectors of all zeros to have a power-of-2 number of inputs; perhaps AVX512VBMI `vpermb` to space out each input with gaps of 5, and merge-masking. Or `vpermt2b` to interleave 2 — Peter Cordes, Apr 11 '23 at 03:10
What's the point of your loops like `THEN3: lodsb` / loop THEN3` and then fall-through to `ELSE3: movsb`? I think you're just looping to get to the right position in the string, very inefficiently. You could use `add esi, ebx` instead of iterating `lodsb`, or better `lea esi, [cad3 + ebx]` / `movsb`. Or even better, `movzx eax, byte [cad3+ebx]` / `stosb` to just load the byte you want and store it. — Peter Cordes, Apr 11 '23 at 03:15
Thanks @PeterCordes, I tried the thing you said to remove my THEN-ELSE loops previous I posted here but I did it wrong, thank you for your advice — Emilio Díaz, Apr 11 '23 at 05:20

Sep Roland · Answer 1 · 2023-04-23T20:23:14.300

it would result: "HBACTeyparleprelloe"

I sure hope this was a typo because otherwise this would become a very nasty exercise indeed! I will be assuming "HBACTeyparleprelleoe".

it can intercalate from stings with the same size

Your present code seems to do that correctly, but why is it so convoluted?
If the current index (offset in the string) is 0, you just do movsb. And if the current index isn't 0, so you need to skip ahead, you do so with a (wasteful) loop of lodsb instructions. Sometimes people wonder why rep lodsb is allowed, well here they have a bit of a use case. Although not really, since the practical solution would be to replace:

mov esi,cad1
cld
mov ecx,ebx
cmp ebx,0
jne THEN1
je ELSE1
THEN1:
    lodsb
    loop THEN1
ELSE1:
    movsb

entirely by:

lea     esi, [cad1 + ebx]
movsb

or alternatively by:

movzx   eax, byte [cad1 + ebx]
stosb

I don't know what to do to make it work for different sized strings

Below I will present 3 solutions, all tested.

Solution 1

Because there are 5 input strings precisely, the 32-bit x86 architecture has just the right number of registers to keep individual pointers in their own register. This approach gives the fastest code but only if the lengths of the individual strings don't differ by too much.

S:      db      43 dup 0
S1:     db      "Hello", 10
S2:     db      "Bye", 10
S3:     db      "AppleADayKeepsTheDoctorAway", 10
S4:     db      "Car", 10
S5:     db      "Tree", 10

        ...

Begin:  mov     ebx, S1                 ; Addresses of the input strings
        mov     ecx, S2
        mov     edx, S3
        mov     esi, S4
        mov     edi, S5
        mov     ebp, S                  ; Address of the output string
.a:     push    ebp                     ; (1)

        movzx   eax, byte [ebx]         ; Read a character from this string
        cmp     al, 10                  ; If this string is exhausted, then
        je      .b                      ;  no longer add to the output string
        inc     ebx                     ; Go to the next character in this string
        mov     [ebp], al               ; Add character to the output string
        inc     ebp

.b:     movzx   eax, byte [ecx]         ; Read a character from this string
        cmp     al, 10                  ; If this string is exhausted, then
        je      .c                      ;  no longer add to the output string
        inc     ecx                     ; Go to the next character in this string
        mov     [ebp], al               ; Add character to the output string
        inc     ebp

.c:     movzx   eax, byte [edx]         ; Read a character from this string
        cmp     al, 10                  ; If this string is exhausted, then
        je      .d                      ;  no longer add to the output string
        inc     edx                     ; Go to the next character in this string
        mov     [ebp], al               ; Add character to the output string
        inc     ebp

.d:     movzx   eax, byte [esi]         ; Read a character from this string
        cmp     al, 10                  ; If this string is exhausted, then
        je      .e                      ;  no longer add to the output string
        inc     esi                     ; Go to the next character in this string
        mov     [ebp], al               ; Add character to the output string
        inc     ebp

.e:     movzx   eax, byte [edi]         ; Read a character from this string
        cmp     al, 10                  ; If this string is exhausted, then
        je      .f                      ;  no longer add to the output string
        inc     edi                     ; Go to the next character in this string
        mov     [ebp], al               ; Add character to the output string
        inc     ebp

.f:     pop     eax                     ; (1)
        cmp     eax, ebp                ; Was anything added to the output string ?
        jne     .a                      ; Yes, then repeat

Solution 2

A minor edit allows us to process any number of input strings. This approach is slower than before, but it suffers from the necessity to pad the strings so they have the same lengths (like you had it in your question).

S:      db      43 dup 0
S1:     db      "Hello", 22 dup 10, 10
S2:     db      "Bye", 24 dup 10, 10
S3:     db      "AppleADayKeepsTheDoctorAway", 10
S4:     db      "Car", 24 dup 10, 10
S5:     db      "Tree", 23 dup 10, 10

        ...

Begin:  xor     ebx, ebx                ; Current offset in every string
        mov     ebp, S                  ; Address of the output string
.a:     push    ebp                     ; (1)

        movzx   eax, byte [S1 + ebx]    ; Read a character from this string
        cmp     al, 10                  ; If this string is exhausted, then
        je      .b                      ;  no longer add to the output string
        mov     [ebp], al               ; Add character to the output string
        inc     ebp

.b:     movzx   eax, byte [S2 + ebx]    ; Read a character from this string
        cmp     al, 10                  ; If this string is exhausted, then
        je      .c                      ;  no longer add to the output string
        mov     [ebp], al               ; Add character to the output string
        inc     ebp

.c:     movzx   eax, byte [S3 + ebx]    ; Read a character from this string
        cmp     al, 10                  ; If this string is exhausted, then
        je      .d                      ;  no longer add to the output string
        mov     [ebp], al               ; Add character to the output string
        inc     ebp

.d:     movzx   eax, byte [S4 + ebx]    ; Read a character from this string
        cmp     al, 10                  ; If this string is exhausted, then
        je      .e                      ;  no longer add to the output string
        mov     [ebp], al               ; Add character to the output string
        inc     ebp

.e:     movzx   eax, byte [S5 + ebx]    ; Read a character from this string
        cmp     al, 10                  ; If this string is exhausted, then
        je      .f                      ;  no longer add to the output string
        mov     [ebp], al               ; Add character to the output string
        inc     ebp

.f:     inc     ebx                     ; Go to next character in every string
        pop     eax                     ; (1)
        cmp     eax, ebp                ; Was anything added to the output string ?
        jne     .a                      ; Yes, then repeat

Solution 3

This time we create an array with pointers to the individual strings. These pointers get used in succession to retrieve a character from the associated string, and when we encounter the end-of-string marker (10), we simply remove the concerned pointer from the array. The other solutions kept dealing with an exhausted string, but here an exhausted string vanishes from the loop.
Because this method has more housekeeping to do, it will run slower on your very regular test data. However once you feed it a more realistic data set, one with short and long strings, it will shine... There's also no limit on the number of input strings, padding is not required and neither is using same-size stringbuffers (like in your program).

P:      dd      S1, S2, S3, S4, S5, 0
S:      db      43 dup 0
S1:     db      "Hello", 10
S2:     db      "Bye", 10
S3:     db      "AppleADayKeepsTheDoctorAway", 10
S4:     db      "Car", 10
S5:     db      "Tree", 10

        ...

Begin:  mov     ebp, S                  ; Address of the output string
        jmp     .e

.a:     mov     edi, ebx
.b:     mov     eax, [edi+4]            ; Move all the stringpointers that follow
        mov     [edi], eax              ;  one position down in the array
        add     edi, 4
        test    eax, eax                ; Until the zero-terminator got moved down
        jnz     .b
        jmp     .d                      ; Continue with the next stringpointer

.c:     movzx   eax, byte [esi]         ; Read a character from the current string
        cmp     al, 10                  ; If this string is exhausted, then
        je      .a                      ;  go remove its pointer from the array
        inc     esi                     ; Go to the next character in the current string
        mov     [ebx], esi              ; Update the current stringpointer
        add     ebx, 4                  ; Go to the next stringpointer
        mov     [ebp], al               ; Add character to the output string
        inc     ebp
.d:     mov     esi, [ebx]              ; Get current stringpointer
        test    esi, esi                ; Arrived at the end of the array if ESI is zero
        jnz     .c
.e:     mov     ebx, P                  ; Address of the array with stringpointers
        mov     esi, [ebx]              ; Get current stringpointer
        test    esi, esi                ; The array is empty if the 1st dword is zero
        jnz     .c

method 1	method 2	method 3	comment
0.4 µsec	0.5 µsec	1.1 µsec	5 short strings
2.0 µsec	2.1 µsec	1.5 µsec	with 1 long string

Expected output:

HBACTeyparleprelleoeADayKeepsTheDoctorAway

[EDIT]

Building upon the many ideas kindly provided by @PeterCordes through comments, and throwing in a couple of new ideas of my own, I was able to write the following faster solutions. (I have dismissed the earlier solution 2 for the reason of the excessive padding that it requires.)

Solution 1b

Switching the roles of EBP and EDI as Peter suggested already improved the code by 25%. And adding instructions to set the pointer of an exhausted string to zero, so as to obtain a cheap way to no longer having to process the string, improved the code by another 20%. I did give stosb a chance, but abandoned the idea because it made the code run 17% slower.

S:      db      43 dup 0
S1:     db      "Hello", 10
S2:     db      "Bye", 10
S3:     db      "AppleADayKeepsTheDoctorAway", 10
S4:     db      "Car", 10
S5:     db      "Tree", 10

        ...

        mov     ebx, S1
        mov     ecx, S2
        mov     edx, S3
        mov     esi, S4
        mov     ebp, S5
        mov     edi, S
.a:     push    edi

        test    ebx, ebx
        jz      .b
        movzx   eax, byte [ebx]
        cmp     al, 10
        je      .clr1
        inc     ebx
        mov     [edi], al
        inc     edi

.b:     test    ecx, ecx
        jz      .c
        movzx   eax, byte [ecx]
        cmp     al, 10
        je      .clr2
        inc     ecx
        mov     [edi], al
        inc     edi

.c:     test    edx, edx
        jz      .d
        movzx   eax, byte [edx]
        cmp     al, 10
        je      .clr3
        inc     edx
        mov     [edi], al
        inc     edi

.d:     test    esi, esi
        jz      .e
        movzx   eax, byte [esi]
        cmp     al, 10
        je      .clr4
        inc     esi
        mov     [edi], al
        inc     edi

.e:     test    ebp, ebp
        jz      .f
        movzx   eax, byte [ebp]
        cmp     al, 10
        je      .clr5
        inc     ebp
        mov     [edi], al
        inc     edi

.f:     pop     eax
        cmp     eax, edi
        jne     .a
        ...

.clr1:  xor     ebx, ebx
        jmp     .b
.clr2:  xor     ecx, ecx
        jmp     .c
.clr3:  xor     edx, edx
        jmp     .d
.clr4:  xor     esi, esi
        jmp     .e
.clr5:  xor     ebp, ebp
        jmp     .f

Solution 3b

The key improvements are:

having the top of the inner loop (.c) 16-byte-aligned
maintaining a count of pointers instead of zero-terminating the array
early-exiting so the remainder of the last-remaining string can get copied verbatim

The use of stosb didn't harm the execution time (only gain is codesize) and so I kept it this time.

P:      dd      S1, S2, S3, S4, S5
S:      db      43 dup 0
S1:     db      "Hello", 10
S2:     db      "Bye", 10
S3:     db      "AppleADayKeepsTheDoctorAway", 10
S4:     db      "Car", 10
S5:     db      "Tree", 10

        ...

        mov     ebx, P          ; Address of the pointers array
        mov     esi, [ebx]
        mov     edi, S          ; Address of the destination string
        mov     ebp, 5          ; Number of remaining pointers
        mov     edx, ebp        ; Inner loop counter
        jmp     .c

        db      (16-($+21) and 15) dup 0 ; 16-byte aligning `.c`

.a:     dec     ebp
        dec     edx
        jz      .d              ; Nothing to copy (is last pointer)
        mov     esi, ebx
        mov     ecx, edx
.b:     mov     eax, [esi+4]
        mov     [esi], eax
        add     esi, 4
        dec     ecx
        jnz     .b
        mov     esi, [ebx]

.c:     movzx   eax, byte [esi]
        cmp     al, 10
        je      .a
        inc     esi
        mov     [ebx], esi
        add     ebx, 4
        stosb
        mov     esi, [ebx]
        dec     edx
        jnz     .c

.d:     mov     ebx, P
        mov     esi, [ebx]
        mov     edx, ebp
        cmp     ebp, 1
        ja      .c              ; Continue while at least 2 strings remain
        jb      .f              ; Done if none remains

        movzx   eax, byte [esi] ; Copy remainder of last-remaining string quickly
        cmp     al, 10
        je      .f
.e:     inc     esi
        stosb
        movzx   eax, byte [esi]
        cmp     al, 10
        jne     .e
.f:     ...

Solution 1b	Solution 3b	Comment
0.3875 µsec (0.4)	0.6681 µsec (1.1)	5 short strings
1.1866 µsec (2.0)	0.9033 µsec (1.5)	with 1 long string

Solution 1 can deal with at most 5 strings.
Solution 3 can deal with any number of strings.

Since you enjoy code-size optimizations, EBP is the worst choice for which register to keep the output pointer in. `[ebp]` takes an extra byte in the encoding (as `[ebp + disp8=0]`), and you have to use `mov [ebp], al` 5 times, vs. each of the other pointers once as a load. Unfortunately we can't use EBP as the temporary to load into / store from, since BPL is only usable in 64-bit mode, and it would take a REX prefix anyway defeating the size saving (also the `cmp al, 10` saves size). I'd probably use EDI for the output pointer, whether or not I'm using `stosb` for size optimization. — Peter Cordes, Apr 20 '23 at 02:06
Instead of requiring newline terminators, I might do `cmp al, 13` / `jb` to detect any CR / LF or NUL terminator. But that also ends on TAB (ASCII 9), so isn't really viable. The push/pop in the outer loop of strategy 1 is probably fine, but I'd probably do it the way a compiler would: `mov [esp], ebp` / loop body / `cmp ebp, [esp] / jne loop`. Slightly larger code size, but saves a uop inside the loop assuming the cmp and micro+macro fuse. Of course you have to reserve stack space ahead of the loop, but you should also push / pop the call-preserved regs you're using. — Peter Cordes, Apr 20 '23 at 02:13
What CPU did you time on? IIRC you have a Core Duo or something? So no uop cache. On a modern Intel, version 2 would probably be about the same speed as version 1. — Peter Cordes, Apr 20 '23 at 02:15
And BTW, thanks for writing up the ideas I posted in comments. Interesting to see what it looks like in code, especially the array of pointers and closing the gap. We could remove some store/reload latency there by having the gap-close code (`.a`) update ESI, like `.a: mov esi, [ebx+4]`, and putting `.d:` at the `test esi,esi`, not the `mov esi, [ebx]`. Or instead of jumping to that `test`, we could do our own `test esi,esi` / `jz .e` after the copy loop, otherwise falling through into `.c:`. — Peter Cordes, Apr 20 '23 at 02:37
Also, for a fixed length of 5 strings, we could do a fixed 16-byte copy (`movups xmm0, [edi+4]` / `movups [edi], xmm0` or maybe 2x `movq mm0` copies on old Intel CPUs; `movups` is more expensive pre Nehalem, and 128-bit ops aren't single uop anyway before Core 2, with `movups` being bad on Pentium M, worse vs 2x movq). That pulls in up to 12 bytes of data past the terminating null pointer. (Maybe pad the pointer array to avoid a store-forwarding stall since the data afterward is S: which will be recently written, or put the output somewhere else so the garbage we pull in isn't recent stores.) — Peter Cordes, Apr 20 '23 at 02:44
I wonder if tracking the array length (or end-pointer) in a register would be more efficient than a terminating null? Could reduce latency to detect branch mispredicts by about 4 cycles (L1d load-use latency). — Peter Cordes, Apr 20 '23 at 02:44
I tested your version 3 on Skylake at 3.9GHz. In a 10000000 iteration repeat loop to hide startup overhead (like page faults, caches, TLBs, and CPU frequency stuff) when timing a static executable with `perf stat`, with the data shown in the code block, it runs 1 iteration per 0.062 µsec, or 239.3 core clock cycles, at 2.8 IPC. Or aligning the code in a way that avoids Skylake JCC erratum penalties, 0.042 µsec or 164 clocks at 4.11 IPC. (I used a 32-byte AVX copy to cheaply reset the pointer array in the outer loop: https://godbolt.org/z/433fh17WY includes perf stat output in a comment). — Peter Cordes, Apr 20 '23 at 05:06
Of course, that many iterations on the same data also trains the branch predictors, pretty much fully learning the pattern, with `44,242` branch misses total out of `1,400,000,192` total branches (mispredict rate of 3.1e-5). So IT-TAGE makes a repeat loop totally unrealistic. IDK if you did any warm-up at all, or if just branch mispredicts and front-end bottlenecks on an older P6 were enough to make it 40 to 50x slower. (Out-of-order exec across repeat-loop iterations must help, too. Esp. once we're down to one pointer, with its store/reload latency bottleneck on pointer updates.) — Peter Cordes, Apr 20 '23 at 05:06
I tried using a 16-byte copy instead of move-down the loop. It's actually slower, like 45.3 ns instead of 40.7 ns (best case when duplicating the `.d:` branch to the end of the copy loop), with about 31M counts for `ld_blocks.store_forward` for 10M outer-loop iterations. So just over 3 store-forwarding stalls per outer iteration on average. Probably esp. bad that they come soon after each other with similar-length strings all ending. A hard-coded 4 dword copies (with 4 loads/4 stores not looping) isn't good either, back up to 43.1 ns. Or 42.3 ns with vmovd + vpinsrd / movlps 8 byte store. — Peter Cordes, Apr 20 '23 at 11:38
https://godbolt.org/z/Gd45obKhj has perf results, and the best version of both the copy-loop and unconditional-copy versions. — Peter Cordes, Apr 20 '23 at 12:09
@PeterCordes Thanks for your improvement tips, especially the one about not using EBP. I updated my answer with a couple of faster versions of the first and third solutions. I tested the code on an Intel® Pentium® dual-core processor T2080 (1.73 GHz, 533 MHz FSB, 1 MB L2 cache). Previous time I only used the sandbox of my assembler and got these coarse timings that were good enough to compare solutions. This time I got adequate timing from using a repeating loop (x1000) and averaging (stable) repetitions of this loop. — Sep Roland, Apr 23 '23 at 13:01
Cool, thanks for the update. I was curious how much it would help to make denser code on an old P6-family CPU without a uop cache; surprised it was as much as 25%. It's not linear with code-size, so maybe this just hit a sweet spot for decode groups where it previously didn't. BTW, "paragraph-aligned" is an odd way to describe things when you're not using real mode segmentation. It's 16-byte aligned for microarchitectural reasons (code fetch after a jump, especially on CPUs without a uop cache); the fact that 16 bytes also happened to be the size of a real-mode paragraph is a coincidence. — Peter Cordes, Apr 23 '23 at 13:14
Multi-uop instructions like `stosb` can only decode in the first slot, so can really be a problem P6-family CPUs unless they're spaced just right. (e.g. a 3-1-1 decode pattern or something.) Your extra NULL-pointer tests in 1b to skip even loading is definitely optimizing for the one-string-left case. I'm surprised it helps much; Pentium-M can macro-fuse cmp/jcc, so load + cmp+jcc is just 2 uops. Loads and branches can each only run at 1/clock (with perfect prediction). Perhaps having separate branches predicts better somehow? Or maybe other front-end effects like decode. — Peter Cordes, Apr 23 '23 at 13:21
The 1-string-left tail can be done 8 or 16 bytes at a time. Your CPU has faster MMX than SSE2 especially for unaligned data, and the strings you're testing with aren't very long (so not many multiples of 16, with a lot of bytes left over). It's strcpy but with a `10` terminator instead of `0`, and you don't have to avoid reading past the end of your input strings. `movq` / `pcmpeqb` / `pmovmskb eax, mm0` / test/jnz. If you leave space in your buffer, you can even let it copy the whole final vector and just terminate the string or return the correct length based on `bsf` on the `pmovmskb` — Peter Cordes, Apr 23 '23 at 13:27
I'd forgotten Pentium-M / Core Duo has "a loop buffer of 4x16 bytes storing predecoded instructions", I was thinking that wasn't until Core 2. But I just noticed that in Agner Fog's Pentium-M chapter (https://agner.org/optimize/ microarch PDF). I forget if taken branches within each of those four 16-byte chunks of machine code still work. — Peter Cordes, Apr 23 '23 at 13:33
Also, I don't suppose you were able to profile your code for branch-mispredict rate? e.g. Linux `perf` or maybe `oprofile` for a CPU that old, or Intel VTune. It'll be non-zero without the IT-TAGE predictors in Haswell and later. I guess we can work out the IPC based on time and frequency, and it's significantly less than 1. (I had to make the strings significantly longer, like 16 to 40 each to start to get some misses on SKL, and even then I was mostly just getting 1 miss per repeat-loop iteration IIRC.) — Peter Cordes, Apr 23 '23 at 13:36
Whether or not your NULL pointer checks improve branch prediction rate, they might at least reduce branch *latency* (time to detect a mispredict and recover). After some previous branch mispredicted (so out-of-order exec didn't already have the result ready), a load + compare has to wait about 3 cycles IIRC from executing the load until the compare+branch can check the result. (And the load can't dispatch + execute until the front-end issues it after some earlier branch mispredict.) But a register compare has no extra latency. — Peter Cordes, Apr 24 '23 at 07:38