I have a loop I use to add numbers with carry.
I'm wondering whether having .done:
align would give me anything? After all, it will branch there only once per call to the function. I know that a C compiler is likely to align all branches affected by a loop. But I'm thinking that it should not cause any penalty (especially since we have rather large instruction caches now a day).
// // corresponding C function declaration
// int add(uint64_t * a, uint64_t const * b, uint64_t const * c, uint64_t size);
//
// Compile with: gcc -c add.s -o add.o
//
// WARNING: at this point I've not worked on the input registers & registers to save
// do not attempt to use in your C program with this very code.
.text
.p2align 4,,15
.globl add
.type add, @function
add:
test %rcx, %rcx
je .done
clc
xor %rbp, %rbp
.p2align 4,,10
.p2align 3
.loop:
mov (%rax, %rbp, 8), %rdx
adc (%rbx, %rbp, 8), %rdx
mov %rdx, (%rdi, %rbp, 8)
inc %rbp
dec %rcx
jrcxz .done
jmp .loop
// -- is alignment here necessary? --
.done:
setc %al
movzx %al, %rax
ret
Is there clear documentation about this specific case by Intel or AMD?
I actually decided to simplify by removing the loop as I only have 3 sizes (128, 256, and 512) so it's easy enough to write an unrolled loop. However, I only need an add so I don't really want to use the GMP for this.
Here is the final code which should work in you C program. This one is for 512 bits, specifically. Just use three of the add_with_carry for 256 bits and just one for the 128 bits versions.
// // corresponding C function declaration
// void add512(uint64_t * dst, uint64_t const * src);
//
.macro add_with_carry offset
mov \offset(%rsi), %rax
adc %rax, \offset(%rdi)
.endm
.text
.p2align 4,,15
.globl add512
.type add512, @function
add512:
mov (%rsi), %rax
add %rax, (%rdi)
add_with_carry 8
add_with_carry 16
add_with_carry 24
add_with_carry 32
add_with_carry 40
add_with_carry 48
add_with_carry 56
ret
Note that I do not need the clc
since I use add
the first time (carry is ignored). I also made it to add to the destination (i.e. dest[n] += src[n]
in C) because I'm not likely to need a copy in my code.
The offsets allow me to not increment the pointers and they only use one extra byte per add.