5

I have a simple program:

int main() {
    char *c = "message";
    char *z = "message";

    if ( c == z )
        printf("Equal!\n");
    else
        printf("Not equal!\n");
    return 0;
}

I wanted to know why this prints Equal!, even when compiled with optimisations turned off (-O0). This would indicate that both c and z point to the same area of memory, and thus the first mutation of z (for example, changing z[0] to a) will be expensive (requiring a copy-and-write).

My understanding of what's happening is that I'm not declaring an array of type char, but rather am creating a pointer to the first character of a string literal. Thus, c and z are both stored in the data segment, not on the stack (and because they're both pointing to the same string literal, c == z is true).

This is different to writing:

char c[] = "message";
char z[] = "message";

if ( c == z ) printf("Equal\n");
else printf("Not equal!\n");

which prints Not equal!, because c and z are both stored in mutable sections of memory (ie, the stack), and are separately stored so that a mutation of one doesn't effect the other.

My question is, is the behaviour I'm seeing (c == z as true) defined behaviour? It seems surprising that the char *c is stored in the data-segment, despite not being declared as const.

Is the behaviour when I try to mutate char *z defined? Why, if char *c = "message" is put in the data segment and is thus read-only, do I get bus error rather than a compiler error? For example, if I do this:

char *c = "message";
c[0] = 'a';

I get:

zsh: bus error  ./a.out

although it compiles happily.

Any further clarification of what's happening here and why would be appreciated.

simont
  • 68,704
  • 18
  • 117
  • 136
  • 4
    You **can't** mutate these; that's undefined behaviour. – Oliver Charlesworth May 31 '13 at 01:42
  • possible duplicate of [Difference between char\* and const char\*?](http://stackoverflow.com/questions/9834067/difference-between-char-and-const-char) – Oliver Charlesworth May 31 '13 at 01:43
  • @OliCharlesworth The accepted answer for that question states that: "`char *name` You can change the char to which name points, and also the char at which it points", ie that `name` is not read-only. You're saying that trying to write to `name` would be undefined, which isn't consistent (ie, I'm missing something). What's the difference? – simont May 31 '13 at 01:46
  • 4
    In the first case it's the literal "message" that's stored with the program -- maybe in write-protected storage, maybe not. The pointers c and z are in automatic storage. You may change the values of the POINTERS c and z if you wish, but you MUST NOT attempt to change the storage they point to (when they are pointing to the string literal). – Hot Licks May 31 '13 at 01:46
  • what really boggles the mind is in Objective-C: `@"string" == [NSString stringWithString:@"string"]` but it is fundamentally the same answer – Grady Player May 31 '13 at 01:54
  • 2
    The compiler is free to do whatever it wants with the immutable strings. It might make the pointers point to the same address, or it might not. If you have a string "sage", it might be at a new address, or it might just point into the middle of "message". That's entirely up to the compiler, and if your code relies on it being done in any particular way, that's bad code. – Lee Daniel Crocker May 31 '13 at 02:03
  • 2
    @simont - It's a quirk of C compilers that string literals are `char *` and not `const char *`, but still can't be written to. It's a special case that's explicitly documented in the standard(s). I believe it's there for backwards compatibility. – detly May 31 '13 at 02:29

4 Answers4

7

"the first mutation of z (for example, changing z[0] to a) will be expensive (requiring a copy-and-write)."

Not "expensive"; try undefined. string literals are constants.

Elazar
  • 20,415
  • 4
  • 46
  • 67
2

The C 2011 Standard. Section 6.4.5. String Literals. Paragraph 7

It is unspecified whether these arrays [string literals] are distinct provided their elements have the appropriate values. If the program attempts to modify such an array, the behavior is undefined.

This means that if two string literals have the same value, a compiler is allowed to have them point to the same location in memory, or different, but it is simply a choice the compiler can make.

Bill Lynch
  • 80,138
  • 16
  • 128
  • 173
1

One of the steps of the C compiler is to find a set of all the string constants in the code. It will only store one copy of any immutable string, even if the string exists in the code twice. So in your example, you have "message" twice -- the compiler will store m e s s a g e \0 in the file (in the read-only data section), and then will initialize both of those pointers to point at the string.

Try making the strings different, and the pointers should now be different too.

Also, try this: print your executable (if it's call a.out, run cat -v a.out). You will see the strings "message", "Equal!" and "Not equal!" sitting in the executable.

Update: (deleted because it was wrong.) Here's the code generated:

8048449:       c7 44 24 1c 6d 65 73 73   movl   $0x7373656d,0x1c(%esp)
8048451:       c7 44 24 20 61 67 65 00   movl   $0x656761,0x20(%esp)
8048459:       c7 44 24 24 6d 65 73 73   movl   $0x7373656d,0x24(%esp)
8048461:       c7 44 24 28 61 67 65 00   movl   $0x656761,0x28(%esp)
8048469:       b8 70 85 04 08            mov    $0x8048570,%eax

It's creating this hex string twice (my machine is little-endian):

6d 65 73 73 61 67 65 00
m  e  s  s  a  g  e  \0

so you're right -- it is putting the string in mutable memory.

Robert Martin
  • 16,759
  • 15
  • 61
  • 87
  • Your update is wrong. It is perfectly okay to modify the arrays `c[]` and `z[]`. They are no longer connected to the string literals that were used to initialize them. In the case of arrays, the string literal contents are copied. The OP's statement that you quoted is correct. – Benjamin Lindley May 31 '13 at 02:14
  • A compiler is not required to make two copies of a string literal be distinct, but it can. – Bill Lynch May 31 '13 at 04:36
0

The literal "message" is a constant and is stored in read-only memory and only one copy is retained by the compiler. I don't know about the latest standards, but this used to vary from compiler to compiler.

Lawrence Dol
  • 63,018
  • 25
  • 139
  • 189