14

I was told by c-faq that compiler do different things to deal with a[i] while a is an array or a pointer. Here's an example from c-faq:

char a[] = "hello";
char *p = "world";

Given the declarations above, when the compiler sees the expression a[3], it emits code to start at the location ``a'', move three past it, and fetch the character there. When it sees the expression p[3], it emits code to start at the location ``p'', fetch the pointer value there, add three to the pointer, and finally fetch the character pointed to.

But I was told that when dealing with a[i], the compiler tends to convert a (which is an array) to a pointer-to-array. So I want to see assembly codes to find out which is right.

EDIT:

Here's the source of this statement. c-faq And note this sentence:

an expression of the form a[i] causes the array to decay into a pointer, following the rule above, and then to be subscripted just as would be a pointer variable in the expression p[i] (although the eventual memory accesses will be different, "

I'm pretty confused of this: since a has decayed to pointer, then why does he mean about "memory accesses will be different?"

Here's my code:

// array.cpp
#include <cstdio>
using namespace std;

int main()
{
    char a[6] = "hello";
    char *p = "world";
    printf("%c\n", a[3]);
    printf("%c\n", p[3]);
}

And here's part of the assembly code I got using g++ -S array.cpp

    .file   "array.cpp" 
    .section    .rodata
.LC0:
    .string "world"
.LC1:
    .string "%c\n"
    .text
.globl main
    .type   main, @function
main:
.LFB2:
    leal    4(%esp), %ecx
.LCFI0:
    andl    $-16, %esp
    pushl   -4(%ecx)
.LCFI1:
    pushl   %ebp
.LCFI2:
    movl    %esp, %ebp
.LCFI3:
    pushl   %ecx
.LCFI4:
    subl    $36, %esp
.LCFI5:
    movl    $1819043176, -14(%ebp)
    movw    $111, -10(%ebp)
    movl    $.LC0, -8(%ebp)
    movzbl  -11(%ebp), %eax
    movsbl  %al,%eax
    movl    %eax, 4(%esp)
    movl    $.LC1, (%esp)
    call    printf
    movl    -8(%ebp), %eax
    addl    $3, %eax
    movzbl  (%eax), %eax
    movsbl  %al,%eax
    movl    %eax, 4(%esp)
    movl    $.LC1, (%esp)
    call    printf
    movl    $0, %eax
    addl    $36, %esp
    popl    %ecx
    popl    %ebp
    leal    -4(%ecx), %esp
    ret 

I can not figure out the mechanism of a[3] and p[3] from codes above. Such as:

  • where was "hello" initialized?
  • what does $1819043176 mean? maybe it's the memory address of "hello" (address of a)?
  • I'm sure that "-11(%ebp)" means a[3], but why?
  • In "movl -8(%ebp), %eax", content of poniter p is stored in EAX, right? So $.LC0 means content of pointer p?
  • What does "movsbl %al,%eax" mean?
  • And, note these 3 lines of codes:
    movl $1819043176, -14(%ebp)
    movw $111, -10(%ebp)
    movl $.LC0, -8(%ebp)

    The last one use "movl" but why did not it overwrite the content of -10(%ebp)? (I know the anser now :), the address is incremental and "movl $.LC0 -8(%ebp) will only overwrite {-8, -7, -6, -5}(%ebp))

I'm sorry but I'm totally confused of the mechanism, as well as assembly code...

Thank you very much for your help.

ibread
  • 1,165
  • 1
  • 10
  • 18
  • 1
    I think your this statement "the compiler tends to convert a (which is an array) to a pointer-to-array" is not correct. Please tell me who said this to you? – Prasoon Saurav Jan 15 '10 at 16:28
  • 9
    +1, for trying it out yourself before asking. – MAK Jan 15 '10 at 16:33
  • 1
    Not pointer-to-array, pointer-to-char. – bmargulies Jan 15 '10 at 16:41
  • 2
    +1 for checking out the ASM. *You have started well, grasshopper...* – Paul Nathan Jan 15 '10 at 16:46
  • @Prasoon Saurrav I found the source of my statement and found there're minor differences between mine and his. It's here: http://c-faq.com/aryptr/aryptrequiv.html And note this sentence: " an expression of the form a[i] causes the array to decay into a pointer, following the rule above, and then to be subscripted just as would be a pointer variable in the expression p[i] (although the eventual memory accesses will be different, " I'm pretty confused of this: since a has decayed to pointer, then why does he mean about "memory accesses will be different?" – ibread Jan 15 '10 at 17:25
  • "although the eventual memory accesses will be different" probably just means that the assembler code for e.g. a[3]=0 will look quite different from p[3]=0. The faq author may have been in mind of the older 'B' language, in which they were identical - when you declared a[10] in B, the compiler would allocate space for 10 ints somewhere, and create a variable called 'a' initialized to point at them. So in fact it was the same thing as a pointer even under the hood, unlike in C. (in fact, I seem to recall in B you could assign to 'a' and change the pointer, if you really wanted to). – greggo Dec 12 '13 at 16:39

4 Answers4

5

a is a pointer to an array of chars. p is a pointer to a char which happens to, in this case, being pointed at a string-literal.

movl    $1819043176, -14(%ebp)
movw    $111, -10(%ebp)

Initializes the local "hello" on the stack (that's why it is referenced through ebp). Since there are more than 4bytes in "hello", it takes two instructions.

movzbl  -11(%ebp), %eax
movsbl  %al,%eax

References a[3]: the two step process is because of a limitation in terms of access to the memory referenced though ebp (my x86-fu is a bit rusty).

movl -8(%ebp), %eax does indeed reference the p pointer.

LC0 references a "relative memory" location: a fixed memory location will be allocated once the program is loaded in memory.

movsbl %al,%eax means: "move single byte, lower" (give or take... I'd have to look it up... I am a bit rusty on this front). al represent a byte from the register eax.

jldupont
  • 93,734
  • 56
  • 203
  • 318
  • So you mean a is also a pointer? But I was told that the type of a is array. – ibread Jan 15 '10 at 17:28
  • @ibread: when it gets down to the assembly level, there is no concept of array really, just accessible memory through pointers etc. – jldupont Jan 15 '10 at 17:33
  • 1
    ... and why the "drive-by down-vote without comment", please?? – jldupont Jan 15 '10 at 18:00
  • 1
    @jldupont but I think maybe it's better to call "a" as the name of block of memory which contains "hello"? after all, there's no dereference to "a" in assembly level. – ibread Jan 15 '10 at 18:15
  • I'm confused.... what does "drive-by down-vote without comment" mean? I did not do anything but up-vote your answer... – ibread Jan 15 '10 at 18:16
  • @jldupont I think I'm clearer after reading your explanation. And I've added another 2 questions in the original post, would you plz show me the answer plz? thank you in advance~ – ibread Jan 15 '10 at 18:21
  • 1
    @ibread: I don't think the "drive-by down-vote" comment was directed at you. It was directed at whoever downvoted this answer. – Fred Larson Jan 15 '10 at 18:22
  • @ibread: somebody down-voted my contribution **without** providing an explanation as to why. This is unfortunately quite common on SO... and childish. We are here as a community, trying to better ourselves. If folks do not explain **why** my contribution is faulty, then we/I cannot learn. – jldupont Jan 15 '10 at 18:24
  • @ibread: you cannot keep adding sub-questions to to your post. This isn't how SO works. Please post another question to let others have the chance to contribute further. Also, it is good practice to accept an answer. Cheers. – jldupont Jan 15 '10 at 18:25
  • @jldupont Thank you very much for your remind. But... my further question is related to this one, and ... should I copy all the content in this post to another one? And, I'm still confused about the functionality of " movsbl %al,%eax". Is it necessary since p[3] has already been retrieved via "movzbl (%eax), %eax"? – ibread Jan 15 '10 at 18:34
  • @ibread: I'll make an exception ;-) (just kidding of course). "movsbl %al,%eax" transfers a single 8bit byte to the "eax" register and zeroes-out the rest of the register (if memory serves me right)... in other words, a char. It facilitates working on this particular "char" from an assembly/machine code point of view. Was that the question? – jldupont Jan 15 '10 at 18:38
  • "`a` is a pointer to an array of chars." NO. `a` is NOT a pointer to an array of chars. A pointer to an array of chars would be like `int (*a)[5]`. `a` *is* an array of chars. – newacct Oct 28 '13 at 23:18
4

Getting on the language side of this, since the assembler side has already been handled:

Note this sentence: " an expression of the form a[i] causes the array to decay into a pointer, following the rule above, and then to be subscripted just as would be a pointer variable in the expression p[i] (although the eventual memory accesses will be different, " I'm pretty confused of this: since a has decayed to pointer, then why does he mean about "memory accesses will be different?

This is because after decaying, access is equal for the (now a pointer value) and the pointer. But the difference is how that pointer value is got in the first place. Let's look at an example:

char c[1];

char cc;
char *pc = &cc;

Now, you have an array. This array does not take any storage other than one char! There is no pointer stored for it. And you have a pointer that points to a char. The pointer takes the size of one address, and you have one char that the pointer points to. Now let's look what happens for the array case to get the the pointer value:

c[0] = 'A';
// #1: equivalent: *(c + 0) = 'A';
// #2: => 'c' appears not in address-of or sizeof 
// #3: => get address of "c": This is the pointer value P1

The pointer case is different:

pc[0] = 'A';
// #1: equivalent: *(pc + 0) = 'A';
// #2: => pointer value is stored in 'pc'
// #3: => thus: read address stored in 'pc': This is the pointer value P1

As you see, for the array case for getting the pointer value needed where we add the index value to (in this case a boring 0), we don't need to read from memory, because the address of the array is already the pointer value needed. But for the pointer case, the pointer value we need is stored in the pointer: We need one read from memory to get that address.

After this, the path is equal for both:

// #4: add "0 * sizeof(char)" to P1. This is the address P2
// #5: store 'A' to address P2

Here is the assembler code generated for the array and the pointer case:

        add     $2, $0, 65  ; write 65 into r2
        stb     $2, $0, c   ; store r2 into address of c
# pointer case follows
        ldw     $3, $0, pc  ; load value stored in pc into r3
        add     $2, $0, 65  ; write 65 into r2
        stb     $2, $3, 0   ; store r2 into address loaded to r3

We can just store 65 (ASCII for 'A') at the address of c (which will be known already at compile or link time when it is global). For the pointer case, we will first have to load the address stored by it into register 3, and then write the 65 to that address.

Johannes Schaub - litb
  • 496,577
  • 130
  • 894
  • 1,212
2

While it is true that arrays are not pointers, they behave very similarly. In both cases the compiler internally stores an address to a typed element, and in both cases there can be one, or more than one element.

In both arrays and pointers, when dereferenced by the [] operator, the compiler evaluates the address of the element you are indexing to by multiplying the index by the size of the data type and adding it to the address of the pointer or array.

The fundamental difference between pointer and arrays is that an array is essentially a reference. Where it is legal to initialize a pointer to null, or change the value that a pointer stores, arrays cannot be null, and they cannot be set to other arrays; they are in essence constant pointers that cannot be set to null.

Additionally it is possible for arrays to be allocated on the stack, and that is not possible for pointers (although pointers can be set to addresses on the stack, but that can get ugly).

Beanz
  • 1,957
  • 1
  • 13
  • 14
  • 1
    "I wanted the structure not merely to characterize an abstract object but also to describe a collection of bits that might be read from a directory. Where could the compiler hide the pointer to name that the semantics demanded? Even if structures were thought of more abstractly, and the space for pointers could be hidden somehow, how could I handle the technical problem of properly initializing these pointers when allocating a complicated object, perhaps one that specified structures containing arrays containing structures to arbitrary depth? ... – Johannes Schaub - litb Jan 15 '10 at 16:50
  • 1
    ... The solution constituted the crucial jump in the evolutionary chain between typeless BCPL and typed C. It eliminated the materialization of the pointer in storage, and instead caused the creation of the pointer when the array name is mentioned in an expression. The rule, which survives in today's C, is that values of array type are converted, when they appear in expressions, into pointers to the first of the objects making up the array." (Dennis M. Ritchie on History of C: http://cm.bell-labs.com/cm/cs/who/dmr/chist.html) – Johannes Schaub - litb Jan 15 '10 at 16:52
  • (Your answer sounds like a `char p[1];` needs more than 1 byte of storage, because you say the compiler would store the address of `p` into a "constant pointer", which is wrong indeed). – Johannes Schaub - litb Jan 15 '10 at 17:07
  • You "array is essentially a reference" "constant pointers that cannot be set to null" Wikipedia "Once a reference is created, it cannot be later made to reference another object; it cannot be reseated. This is often done with pointers. References cannot be null, whereas pointers can" ... humour abounds. – Alex Brown Jan 15 '10 at 17:27
0

These definitions look similar, but are in reality quite different.

Assume your arrays are declared inside a function:
void f()
{
    char a[] = "hello";
    char *p = "world";
}

In the first case 'a' decays to a const pointer that points at 6 chars on the STACK. In the second case 'p' is a non-const pointer that points at 6 chars in the CONST region (data segment).

Its quite legal to write:

a[3] = 'L';

but

p[3] = 'L';

looks correct, but will cause a memory violation, because the array of characters is not on the stack, but in a read only section.

Furthermore,

a++

is illegal ('a' decays to a const pointer which is an r-value), but

p++

is legal (p is an l-value).

resigned
  • 1,044
  • 1
  • 10
  • 11