14

Given the code snippet:

int main()
{
    printf("Val: %d", 5);
    return 0;
}

is there any guarantee that the compiler would store "Val: %d" and '5' contiguously? For example:

+-----+-----+-----+-----+-----+-----+-----+-----+-----+
| ... |  %d | ' ' | ':' | 'l' | 'a' | 'V' | '5' | ... |
+-----+-----+-----+-----+-----+-----+-----+-----+-----+
      ^                                   ^     ^
      |           Format String           | int |

Exactly how does are these parameters allocated in memory?

Furthermore, does the printf function access the int relative to the format string or by absolute value? So for example, in the data

+-----+-----+-----+-----+-----+-----+-----+-----+-----+
| ... |  %d | ' ' | ':' | 'l' | 'a' | 'V' | '5' | ... |
+-----+-----+-----+-----+-----+-----+-----+-----+-----+
      ^                                   ^     ^
      |           Format String           | int |

when the function encounters %d would there already be a stored memory address for the first parameter of the function which would be referenced or would the value be calculated relative to the first element of the format string?

Sorry if I'm being confusing, my primary goal is to understand string formatting exploits where the user is allowed to supply the format string as described in this document

http://www.cis.syr.edu/~wedu/Teaching/cis643/LectureNotes_New/Format_String.pdf

My concerns arise on the attack described on page 3 and 4. I figured that the %x's are to skip the 16 bits that the string takes up which would indicate that the function allocated contiguously and references relatively but other sources indicate that there is not guaranteed that the compiler must allocate contiguously and I was concerned that the paper was a simplification.

BaseZen
  • 8,650
  • 3
  • 35
  • 47
Mikey G
  • 181
  • 1
  • 9
  • 4
    Well, for one thing, the format string isn't stored on the stack. – Ross Ridge Aug 02 '15 at 02:31
  • Thanks. Fixed the question. – Mikey G Aug 02 '15 at 02:33
  • 3
    Those lecture notes are terrible. All the world is not an i386. As far as C is concerned, there might not even be a stack. – user464502 Aug 02 '15 at 03:24
  • I'm reading these notes now http://crypto.stanford.edu/cs155/papers/formatstring-1.2.pdf which are almost definitely better. – Mikey G Aug 02 '15 at 03:27
  • 2
    It looks like the lecture notes referenced are taken almost directly from the 2001 paper. Those are still assuming 386 architecture. There is probably an analogous attack on AMD64, but the paper doesn't address that. – user464502 Aug 02 '15 at 03:40
  • I'm having a terrible time with this statement on page 11: "Our format string is usually located on the stack itself, so we already have near to full control over the space, where the format string lies." ???? When would that *ever* happen? The paper seems to hinge on this. But the x86_64 assembly is obvious: `.section .rodata: .LC0: .string "%08X %08X etc.\n" ` and then in the instruction stream: `movl $.LC0, %edi; movl $0, %eax; call printf` – BaseZen Aug 02 '15 at 05:09
  • `printf` does not allocate any memory. The issue of where `"Val: %d"` and `5` are stored has nothing to do with printf. – M.M Aug 02 '15 at 05:29
  • 2
    Note in addition to answers: the C specification goes to **great** lengths to **avoid** specifying **anything** about the layout of arguments in a varargs function call. When you read it, it's almost pathological how badly they wanted to avoid letting you make such an assumption. – Cort Ammon Aug 02 '15 at 07:03
  • This is all implementation-defined. There are no guarantees. Note that modern calling conventions don't pass parameters on the stack any more. (And don't even get started on architectures like 68K which have dedicated address registers and integer registers.) – Raymond Chen Aug 02 '15 at 15:55

2 Answers2

16

is there any guarantee that the compiler would store "Val: %d" and '5' contiguously

It's virtually guaranteed they won't be. The 5 is small enough that it can be embedded right in the instruction stream rather than loaded through a memory address (pointer) -- something like movl #5, %eax and/or followed by a push onto the stack -- whereas the string object will be laid out in the read-only data area of the executable image, and will be referenced via a pointer. We're talking about compile time layout of the executable image.

Unless you mean the runtime layout of the stack in which yes, the word-sized pointer to that string, and the word-sized constant 5, will be next to each other. But the order is probably the reverse of what you expect -- study 'C function calling convention'.

[Later edit: Running some code samples with -S (output assembly) now; I'm reminded that with light register usage in the caller (i.e. CPU registers can be overwritten without harm), and few arguments to the called function, the arguments can be passed entirely via registers to save instructions and memory. So the layout of the stack is actually tricky to predict, even if the attacker had access to the source code. Especially with gcc -O2, which collapsed my main -> my_function -> printf function sequence into main -> printf]

Most exploits employ stack overruns, since malicious code runs into a brick wall trying to modify memory in the aforementioned read-only data area -- OS aborts the process.

The behavior of printf is peculiar in that the format string is like a miniature computer program that tells printf to look at arguments on the stack for every '%' format specifier that it finds. If those arguments were never in fact pushed, and/or were of different sizes, printf will blindly traverse portions of the stack it shouldn't and perhaps reveal data further up the stack (down the call chain) where private data may lie. If the first argument to printf is at least a constant, a compiler can at least warn you when subsequent arguments mismatch the '%' specifiers, but when it's a variable, all bets are off.

printf is awful from a security perspective and is computationally intensive, but very powerful and expressive. Welcome to C. :-)

2nd later edit Now your first question in the comments...as you can see your terminology and perhaps thoughts were a bit garbled. Study the following to get a sense of what's going on. Don't worry about pointers to strings yet. This was compiled with gcc 4.8.2 on Linux 3.13 64-bit with no flags. Note how the excessive use of format specifiers essentially walks backward through the stack, revealing arguments that were passed in a previous function call.

/* Do not compile this at home. */
#include <stdio.h>

int second() {
  printf("%08X %08X %08X %08X %08X %08X %08X %08X\n");
}

int first(int a, int b, int c, int d, int e, int f, int g, int h) {
  second();
}

int main(int argc, char **argv) {
  first(0xDEEDC0DE, 0x1EADBEEF, 0x11BEDEAD, 0xCAFAF000, 0xDAFEBABE, 0xAACEBACE, 0xE1ED1EAA, 0x10F00FAA);
  return 0;
}

Two back-to-back runs, stdio output:

1EADBEEF 11BEDEAD CAFAF000 DAFEBABE AACEBACE 75F83520 00400568 88B151C8

1EADBEEF 11BEDEAD CAFAF000 DAFEBABE AACEBACE 8B4CBDC0 00400568 7BB841C8

Pete Becker
  • 74,985
  • 8
  • 76
  • 165
BaseZen
  • 8,650
  • 3
  • 35
  • 47
  • So let's say there was a printf statement that allowed the user to supply the formatting string with no other arguments. How could we get a %s format character to read the value at any given memory address given that the format string is a contiguous pointer? If this string is only contiguously represented as a pointer then we would be only be able to read the format string itself. Once we compensated for any buffers and variables in the format string would there be any way to get the %s to read a particular series of bits and interpret it as a pointer so it would read those? – Mikey G Aug 02 '15 at 03:00
  • What do you think you mean by "contiguously represented as a pointer"? – user464502 Aug 02 '15 at 03:20
  • I mean a pointer that is directly next to the other parameters. – Mikey G Aug 02 '15 at 03:22
  • I'm still not sure what that means. But a pointer value might not even be in memory, it could just be in a register. I don't think it's meaningful to think of registers as being "next to" each other on most architectures. – user464502 Aug 02 '15 at 03:27
  • 2
    Agreed, follow-up question in comments is very hard to interpret. Edited answer to demonstrate stack layout and exploit concepts, without getting into the unrealistically powerful and precise exploit OP seems to be referring to. – BaseZen Aug 02 '15 at 04:38
  • 2
    Those arguments aren't all passed on the stack on linux AMD64. – user464502 Aug 02 '15 at 04:52
  • Also agreed, see my 3rd paragraph – BaseZen Aug 02 '15 at 05:19
  • In response to 2nd later edit, so the 8th set of characters in each print would be the pointer to the string in second()? – Mikey G Aug 02 '15 at 06:09
  • No, that doesn't seem to show up on the stack at all -- it's passed via a register. Adding this line: `printf("String pointer: %p\n", (char *)"%08X %08X %08X %08X %08X %08X %08X %08X");` anywhere in the code strongly implies that the *7th* word is *some* kind of memory address, but not the string pointer. – BaseZen Aug 02 '15 at 06:18
  • Could it be the location of the instruction pointer in the calling function? I read that was stored after the arguments on the stack earlier when reading about function calling. – Mikey G Aug 02 '15 at 06:27
  • Here is a great link, with illustrations and examples: [The details of C Function Stack](http://www.tenouk.com/Bufferoverflowc/Bufferoverflow2a.html) – paulsm4 Aug 02 '15 at 07:09
3

Interesting question. Here is the assembly output from two test programs: one 32-bit/MSVC, the other 64-bit GCC:

Test program:

/*
 * Sample output:
 * A
 * B: 49, 2, 5.000000
 */
#include <stdio.h>

int main(int argc, char *argv[]) {
  printf ("A\n");
  printf ("B: %d, %c, %f\n", 0x31, 0x32, 5.0);
  return 0;
}

MSVS/32-bit assembly (cl /Fa):

_DATA   SEGMENT
$SG2938 DB  'A', 0aH, 00H
    ORG $+1
$SG2939 DB  'B: %d, %c, %f', 0aH, 00H
...
CONST   SEGMENT
__real@4014000000000000 DQ 04014000000000000r   ; 5
...
    push    OFFSET $SG2938
    call    _printf
...
    movsd   xmm0, QWORD PTR __real@4014000000000000
    movsd   QWORD PTR [esp], xmm0
    push    50                  ; 00000032H
    push    49                  ; 00000031H
    push    OFFSET $SG2939
    call    _printf

GCC/64-bit assembly (gcc -S):

.LC0:
        .string "A"
.LC1:
        .string "B: %d, %c, %f\n"
...
        movl    %edi, -4(%rbp)   // You'll notice that GCC substitutes "puts()" for "printf()" here
        movq    %rsi, -16(%rbp)
        movl    $.LC0, %edi
        call    puts
...
        movl    $.LC1, %eax     // Also notice the absence of "push": we're passing arguments in registers, instead of on the stack
        movsd   .LC2(%rip), %xmm0
        movl    $50, %edx
        movl    $49, %esi
        movq    %rax, %rdi
        movl    $1, %eax
        call    printf
paulsm4
  • 114,292
  • 17
  • 138
  • 190