Why does this program dispense the following canary values and assembly code?

Question

So in the following program intlen() written in this book, when I read the assembly language that corresponds to this program intlen() provides a protected Canary value, as well as several values that are ALL placed onto the stack.

My problem with this is that the book makes it very clear that you have six registers available to put six variables into, and once you put these variables into the registers and once you go past 6 registers, THEN everything goes onto the stack.

What I need to know is why the program intlen() puts all of its values onto the stack and understand why the canary value is placed where it is.

I've already tried google searching the answer as well as counting the variables and arguments in previous programs, because 'calling' is still a thing, right? Thing is, these variables in the previous programs only go up to a count of four.

Edit: I also would like to know how much len allocates on the stack pointer when protected by a Canary value. Here is how I think len works. the argument *s is worth 8 bits, the stack protector is another 8 bits since we are on a 64 bit system, and the stack frame on return is 8 bits, so it requires a total of 24 bits, right?

/* C Code */ 
int len(char *s){
  return strlen(s);
}

void iptoa(char *s, long *p){
  long val = *p; 
  sprintf(s, "%ld", val); 
}

int intlen(long x){
  long v; 
  char buf[12]; 
  v = x; 
  iptoa(buf, &v); 
  return len(buf); 
}

=====assembly counterpart=======

without stack protector

1. intlen: 
2. subq  $40, %rsp 
3. movq %rdi, 24(rsp) 
4. leaq 24(%rsp), %rsi
5. movq %rsp, %rdi 
6. call iptoa

With protector

0. intlen:
1. subq $56, %rsp  
2. movq %fs:40, %rax  < Canary Value
3. movq %rax, 40(%rsp)  < Where the Canary goes (Why does this go here?) 
4. xorl %eax, %eax 
5. movq %rdi, 8(%rsp)
6. leaq 8(%rsp), %rsi
7. leaq 16(%rsp), %rdi 
8. call iptoa

I expect most of the variables to be in registers, but everything is put onto the stack pointer as you can see, and I don't really understand why yet. Thank you for your time.

Canary is supposed the go into stack, it's the whole meaning of the stack canary. It is there as a simple protection from stack smashing attacks, where the return address is overwritten. — hingev, May 01 '19 at 00:12
So, if I'm reading this correctly, it goes right above the return address? That makes sense, so what size is the canary ? — , May 01 '19 at 00:23
Yes, so that in case of an overflow, it would get overwritten, and before returning from the function, **stk_chck_fail** be called if the value put there is not the same read, so it could call **abort** and quit. — hingev, May 01 '19 at 00:28
Okay, so is that why it takes up 8 bits, because it takes a pointer on the frame all its own to protect the return value? — , May 01 '19 at 00:30
It takes 32 bits for x86 and 64 bits for x86-64. So 4 bytes or 8 bytes depending on the architecture. — hingev, May 01 '19 at 00:32
Okay, I'm working with the x86-64 architecture so, 8 bytes. :) — , May 01 '19 at 00:41
@Matthew_J_Barnes Inconsistency in your question: It's 8 *bytes*, not 8 *bits*. Do *not* confuse the two -- it can be very dangerous. — S.S. Anne, May 01 '19 at 01:41
You should read a basic tutorial on assembler or something before trying to decipher anything. Anyone who has read the basic tutorial on 64-bit x86 asm would know instantly that `%rax` is 64 bit and the same information is carried by the `q` prefix of `mov` as well. — Antti Haapala -- Слава Україні, May 01 '19 at 05:13
@Antti Haapala. . .Yes, I know %rax is 64 bits. . . I read the chapter. . . I'm well aware that movq is a 64 bit carrier. . . . Just as movl is 32 bits and movw is 16 and movb is 8. Why is that even being questioned? — , May 01 '19 at 15:22

score 0 · Accepted Answer · answered May 01 '19 at 00:17

Stack canary - is a method of protection against stack smashing attacks, which tend to be comment if an overflow is left. So that's why, by default, gcc will insert canary checks if a function has an internal buffer that's allocated from stack.

This can be turned off, using -fno-stack-protector.

Also the size which triggers gcc to add the canary is chosen by ssp-buffer-size.

Find out more here

As for why are local variables stored in stack ? - well, where else would you store them. You can specify that a variable be optimized as a register with the register keyword, but it's not a guarantee. The number of your registers is limited, far less than what stack could handle. Storing them in registers is only justified for speed optimizations.

*Storing them in registers is only justified for speed optimizations.* But the OP *did* compile with optimization enabled. Variables *are* stored in registers or optimized away, except of course for the array, and the var that has to be in memory when its address is passed to another function. — Peter Cordes, May 01 '19 at 03:32
I did not compile at all, this was literally an example IN the book Operating Systems a programmers perspective. — , May 01 '19 at 15:19

score 0 · Answer 2 · answered May 01 '19 at 05:10

the book makes it very clear that you have six registers available to put six variables into

You're reading a book about 32-bit x86. (And the book assumes EBP will be used as a frame pointer, leaving only 6 of the 8 integer regs as truly general purpose)

You're compiling for x86-64 with optimization enabled, which includes -fomit-frame-pointer, so you actually have 15 general-purpose integer registers.

What I need to know is why the program [function] intlen() puts all of its values onto the stack

That's not quite what's going on. x stays in RDI instead of being spilled to the stack on function entry, like you'd get if you disabled optimization (gcc -O0). Compile without optimization to see a big difference.

The compiler is keeping vars in regs as much as possible, but v and buf have to exist in memory because you pass pointers to them to a non-inline function.

You seem to have disabled inlining of iptoa somehow. Maybe you compiled with only -O1, because you don't have __attribute__((noinline)) on your definition of iptoa. If you enabled full optimization (-O3), you'd see that v is optimized away and you just get a movq %rdi, %rdx to pass x as the 3rd arg to sprintf.

Passing &v to a non-inline iptoa means the memory for v has to be "in sync", because iptoa is allowed to read that memory via the pointer you passed it. See also "escape analysis" - if a pointer to a variable "escapes" the function, the compiler can't optimize it away or do too many weird things with it.

IDK why you're passing an integer by reference; you've written code that forces the compiler to use memory for most of its variables. (If it can't inline.)

BTW, you know your function is very inefficient, right? You don't need to calculate every decimal digit with sprintf, just find the first power of 10 that's greater-than the number.

int intlen_fast(long x) {
    unsigned long absx = x;
    unsigned len = 1;      // even 0..9 takes 1 decimal digit
    if (x<0) {
        absx = -x;         // unsigned abs correctly handles the most-negative 2's complement integer
        len = 2;           // the minus sign
    }

    // don't need to check for overflow of pow10 with 64-bit integers
    // but in general we do to get the right count. (TODO)
    for (unsigned long pow10 = 10; pow10 <= absx ; pow10*=10) {
        len++;
    }
    return len;
}

Doing pow10 *= 10; is significantly more efficient than x /= 10, even with optimized division by a compile-time constant.

For 64-bit unsigned long, this has the very nice property that abs(LLONG_MIN) = 9223372036854775808ULL, and the next highest power of 10 doesn't overflow unsigned long long. (ULLONG_MAX = 18446744073709551615ULL)

If that wasn't the case (like for 32-bit unsigned long in other ABIs), you'd need to check for the special case of absx >= 1000000000 to correctly handle input magnitudes in the range 1000000000 to 2147483648, because 2^32-1 = 4294967296. (Fortunately we don't get an infinite loop, just 2 extra iterations until pow10 = 0xd4a51000 which is unsigned above the magnitude of any signed 32-bit integer. But it's still the wrong answer!)

In general, C++ has std::numeric_limits<long>::digits10 vs. std::numeric_limits<unsigned long>::digits10 might be useful for detecting at compile-time whether we need an extra check. Or actually not, because it rounds down for the binary bit-width times std::log10(2).

Maybe a compile-time check based on How to round down to the nearest power of 10? of LONG_MAX being less than that of ULONG_MAX, if your compiler can do constant-propagation through floor(log10(ULONG_MAX)).

If you didn't want to worry about the details of pow10 maybe overflowing, it would still be much faster than calling sprintf to just do repeated division by 10 to count digits.

Or maybe do one division by 10, and then loop pow10 upward. That would be safe from overflow / wraparound, and simple. (But you still have to handle negative input specially).

But anyway, the optimized version from gcc8.3 -O3 does keep all its variables in registers, of course (Godbolt compiler explorer). -fstack-protector-strong has no effect on this function because it doesn't have any arrays.

# gcc8.3 -O3 -fverbose-asm -fstack-protector-strong
intlen_fast(long):
        testq   %rdi, %rdi    # x
        js      .L14        #,
        movl    $1, %eax        #, <retval>
        movl    $1, %edx        #, len
.L15:
        cmpq    $9, %rdi        #, absx
        jbe     .L13      #,
        movl    $10, %eax       #, pow10
.L17:
        leaq    (%rax,%rax,4), %rax     #, tmp95    # pow10 * 5
        addl    $1, %edx        #, len
        addq    %rax, %rax      # pow10             # pow10 *= 10
        cmpq    %rax, %rdi      # pow10, absx
        jnb     .L17      #,
        movl    %edx, %eax      # len, <retval>
.L13:
        ret     
.L14:
        negq    %rdi    # absx
        movl    $2, %eax        #, <retval>
        movl    $2, %edx        #, len
        jmp     .L15      #

(It looks like a missed optimization that gcc sets both EAX and EDX. It should just use RDX inside the loop for pow10 and use len in EAX.)

See the Godbolt link for some test callers that show it works for corner cases like -9, 99, 100, and 101 without off-by-one errors. And for large inputs.

Okay, Peter. . . .This is not a book on 32 bit 86 it's a book on 86x64-bit. Secondly, I did not compile anything. Thirdly, I did not even write this program. . . This is a problem written in a book a homework for an entry level class on Computer Architecture is based on said problem. Half of the things you mentioned have never been brought up, suggested or used. . . . While your bredth of knowledge is VERY impressive, I'm wondering what it's relevance is to my question. — , May 01 '19 at 15:25
@Matthew_J_Barnes: I thought I made the first section of the answer pretty clear: the compiler is putting things in memory only when it has to because the code takes the address and passes it to another function. And if your book is talking about x86-64, then maybe they mean 6 call-clobbered registers? But that's not correct for the x86-64 System V calling convention they're using. There are 6 arg-passing registers plus RAX, R10, and R11 which are also call-clobbered and thus can be freely used without save/restore. — Peter Cordes, May 01 '19 at 17:15
Jesus Christ you're brilliant. . . . . . So, what you're saying is. . . Because I have a variable that gets passed to another function, there is no need to store it in registers but rather the stack pointer, right? Also, can you recommend some text or learning tools to me? I would very much like to work at Silicon Valley before I'm 30. — , May 01 '19 at 19:11
@Matthew_J_Barnes: The compiler always *wants* to keep everything in registers, or optimize away completely. So instead of saying "no need", I'd instead say "no *opportunity*", because you don't do anything with the variable other than pass it *by reference* to another function. If you passed it by value, the value would just get copied to a register. — Peter Cordes, May 01 '19 at 19:19
@Matthew_J_Barnes: If you want to learn more about asm, https://stackoverflow.com/tags/x86/info has some good resources. Looking at compiler output for simple functions like this is a good way to get started, though. [How to remove "noise" from GCC/clang assembly output?](//stackoverflow.com/q/38552116) and especially Matt Godbolt's CppCon2017 talk [“What Has My Compiler Done for Me Lately? Unbolting the Compiler's Lid”](https://youtu.be/bSkpMdDe4g4) is great. — Peter Cordes, May 01 '19 at 19:21

Why does this program dispense the following canary values and assembly code?

2 Answers2