the book makes it very clear that you have six registers available to put six variables into
You're reading a book about 32-bit x86. (And the book assumes EBP will be used as a frame pointer, leaving only 6 of the 8 integer regs as truly general purpose)
You're compiling for x86-64 with optimization enabled, which includes -fomit-frame-pointer
, so you actually have 15 general-purpose integer registers.
What I need to know is why the program [function] intlen()
puts all of its values onto the stack
That's not quite what's going on. x
stays in RDI instead of being spilled to the stack on function entry, like you'd get if you disabled optimization (gcc -O0
). Compile without optimization to see a big difference.
The compiler is keeping vars in regs as much as possible, but v
and buf
have to exist in memory because you pass pointers to them to a non-inline function.
You seem to have disabled inlining of iptoa
somehow. Maybe you compiled with only -O1
, because you don't have __attribute__((noinline))
on your definition of iptoa
. If you enabled full optimization (-O3
), you'd see that v
is optimized away and you just get a movq %rdi, %rdx
to pass x
as the 3rd arg to sprintf
.
Passing &v
to a non-inline iptoa
means the memory for v
has to be "in sync", because iptoa
is allowed to read that memory via the pointer you passed it. See also "escape analysis" - if a pointer to a variable "escapes" the function, the compiler can't optimize it away or do too many weird things with it.
IDK why you're passing an integer by reference; you've written code that forces the compiler to use memory for most of its variables. (If it can't inline.)
BTW, you know your function is very inefficient, right? You don't need to calculate every decimal digit with sprintf, just find the first power of 10 that's greater-than the number.
int intlen_fast(long x) {
unsigned long absx = x;
unsigned len = 1; // even 0..9 takes 1 decimal digit
if (x<0) {
absx = -x; // unsigned abs correctly handles the most-negative 2's complement integer
len = 2; // the minus sign
}
// don't need to check for overflow of pow10 with 64-bit integers
// but in general we do to get the right count. (TODO)
for (unsigned long pow10 = 10; pow10 <= absx ; pow10*=10) {
len++;
}
return len;
}
Doing pow10 *= 10;
is significantly more efficient than x /= 10
, even with optimized division by a compile-time constant.
For 64-bit unsigned long
, this has the very nice property that abs(LLONG_MIN) = 9223372036854775808ULL
, and the next highest power of 10 doesn't overflow unsigned long long
. (ULLONG_MAX
= 18446744073709551615ULL)
If that wasn't the case (like for 32-bit unsigned long
in other ABIs), you'd need to check for the special case of absx >= 1000000000
to correctly handle input magnitudes in the range 1000000000
to 2147483648
, because 2^32-1 = 4294967296
. (Fortunately we don't get an infinite loop, just 2 extra iterations until pow10
= 0xd4a51000 which is unsigned above the magnitude of any signed 32-bit integer. But it's still the wrong answer!)
In general, C++ has std::numeric_limits<long>::digits10
vs. std::numeric_limits<unsigned long>::digits10
might be useful for detecting at compile-time whether we need an extra check. Or actually not, because it rounds down for the binary bit-width times std::log10(2)
.
Maybe a compile-time check based on How to round down to the nearest power of 10? of LONG_MAX
being less than that of ULONG_MAX
, if your compiler can do constant-propagation through floor(log10(ULONG_MAX))
.
If you didn't want to worry about the details of pow10
maybe overflowing, it would still be much faster than calling sprintf
to just do repeated division by 10 to count digits.
Or maybe do one division by 10, and then loop pow10
upward. That would be safe from overflow / wraparound, and simple. (But you still have to handle negative input specially).
But anyway, the optimized version from gcc8.3 -O3
does keep all its variables in registers, of course (Godbolt compiler explorer). -fstack-protector-strong
has no effect on this function because it doesn't have any arrays.
# gcc8.3 -O3 -fverbose-asm -fstack-protector-strong
intlen_fast(long):
testq %rdi, %rdi # x
js .L14 #,
movl $1, %eax #, <retval>
movl $1, %edx #, len
.L15:
cmpq $9, %rdi #, absx
jbe .L13 #,
movl $10, %eax #, pow10
.L17:
leaq (%rax,%rax,4), %rax #, tmp95 # pow10 * 5
addl $1, %edx #, len
addq %rax, %rax # pow10 # pow10 *= 10
cmpq %rax, %rdi # pow10, absx
jnb .L17 #,
movl %edx, %eax # len, <retval>
.L13:
ret
.L14:
negq %rdi # absx
movl $2, %eax #, <retval>
movl $2, %edx #, len
jmp .L15 #
(It looks like a missed optimization that gcc sets both EAX and EDX. It should just use RDX inside the loop for pow10
and use len
in EAX.)
See the Godbolt link for some test callers that show it works for corner cases like -9
, 99
, 100
, and 101
without off-by-one errors. And for large inputs.