3

I am taking a MOOC course CS50 from Harvard. In one of the first lectures we learned about variables of different data types: int,char, etc.

What I understand is that command (say, within main function) int a = 5 reserves a number of bytes (4 for the most part) of memory on the stack and puts there a sequence of zeros and ones which represent 5.

The same sequence of zeros and ones also could mean a certain character. So somebody needs to keep track of the fact that the sequence of zeros and ones in the memory place reserved for a is to be read as an integer (and not as a character).

The question is who does keep track of it? The computer's memory by sticking a tag to this place in memory saying "hey, whatever you find in these 4 bytes read as an integer"? Or the C compiler, which knows (looking at the type int of a) that when my code asks it to do something (more precisely, to produce a machine code doing something) with the value of a it needs to treat this value as an integer?

I would really appreciate an answer tailored to a C beginner.

zesy
  • 481
  • 5
  • 12
  • 1
    While writing this question and thinking how would I design a computer I came to the conclusion that definitely C compiler would need to keep track of how to interpret a content of a memory chunk. But I would still need a confirmation from somebody who knows for sure. – zesy Dec 24 '17 at 11:08
  • 2
    nobody keeps track of it. The compiler helps you somewhat by doing type checking, but with a bit of effort the contents of a memory cell can be interpreted any way you like. Hence it is the responsibility of the programmer to choose the correct interpretation. Most of the time the intended interpretation matches with the declared data type within your program. – Ronald Dec 24 '17 at 11:10
  • You (the human programmer) takes track and care of variables. At runtime, variables don't exist anymore, the process has memory locations (some of them corresponding to variables in the source code). – Basile Starynkevitch Dec 24 '17 at 11:36
  • There *has* been computers with [tagged memory](https://en.wikipedia.org/wiki/Tagged_architecture) where the memory cells stored the type alongside the value. – Bo Persson Dec 24 '17 at 12:18

5 Answers5

2

For the main part it's the C compiler that keeps track.

During the compilation process the compiler builds up a large data structure called the parse tree. It also keeps track of all variables, functions, types, ... everything with a name (i.e. identifier); this is called the symbol table.

The nodes of both the parse tree and the symbol table have an entry in which the type is recorded. They keep track of all the types.

With mainly these two data structures in hand, the compiler can check if your code does not violate type rules. It allows the compiler to warn you if you use incompatible values or variable names.

C does allow implicit conversation between types. You can for example assign an int to a double. But in memory these are completely different bit patterns for the same value.

In earlier (higher abstraction level) phases of the compilation process, the compiler does not deal with bit patterns yet (or too much), and makes conversions and checks at a higher level.

But during the assembly code generating process, the compiler needs to finally figure it all out in bits. So for an int to double conversion:

int    i = 5;
double d = i; // Conversion.

the compiler will generate code to make this conversion happen.

In C however it's very easy to make mistakes and mess things up. This is because C is not a very strongly typed language and is rather flexible. So a programmer also needs to be aware.

Because C does not keep track of types anymore after compilation, so when program is run, a program can often silently continue running with the wrong data after executing some of your mistakes. And if you're 'lucky' that the program crashes, the error message you is not (very) informative.

meaning-matters
  • 21,929
  • 10
  • 82
  • 142
2

With the C language, it's the compiler.

At run-time, there's only the 32 bits = 4 bytes on the stack.

You ask "The computer's memory by sticking a tag to this place...": that's impossible (with current computer architectures - thanks for the hint from @Ivan). The memory itself is just 8 bits (being 0 or 1) ber byte. There is no place in memory that can tag a memory cell with whatever additional info.

There are other languages (e.g. LISP, and to some degree also Java and C#) that store an integer as a combination of the 32 bits for the number plus a few bits or bytes that contain some bit-encoded tagging that here we have an integer. So they need e.g. 6 bytes for a 32-bit integer. But with C, that's not the case. You need knowledge from the source code to correctly interpret the bits found in memory - they don't explain themselves. And there have been special architectures that supported tagging in hardware.

Ralf Kleberhoff
  • 6,990
  • 1
  • 13
  • 7
  • 2
    "that's impossible" - that is not, back in the days existed computers with typed or "tagged" memory. See [Tagged_architecture](https://en.wikipedia.org/wiki/Tagged_architecture) In those architectures processor could check if, for example, value used for conditional jump is actually a boolean. –  Dec 24 '17 at 13:49
  • @Ivan Yes, I've been working with a Symbolics Lisp Machine in the 1990s, having such a tagged architecture, but all current memory subsystems organize the memory as homogeneous arrays of fixed-size words (typically bytes). I'll edit my answer. – Ralf Kleberhoff Dec 24 '17 at 14:18
2

In C, memory is untyped; no information beyond its value is stored there. All type information is computed at compile time from the type of an expression (a variable name, a value computation, a pointer dereferencing etc.) This computation depends on the information the programmer provides through declarations (also in headers) or casts. If that information is wrong, e.g. because a function prototype's parameters are declared wrong, all bets are off. The compiler warns about or prevents mis-declarations in the same "translation unit" (file with headers), but between translation units there are no (or not many?) protections. That's one reason why C has headers: They share common type information between translation units.

C++ keeps this idea but additionally offers run time type information (as opposed to compile time type information) for polymorphic types. It's obvious that every polymorphic object must carry extra information somewhere (not necessarily close to the data though). But that is C++, not C.

Peter - Reinstate Monica
  • 15,048
  • 4
  • 37
  • 62
1

You have a stack pointer which gives an absolute offset for the topmost stack frame in memory.

For a given scope of execution, the compiler knows which variable is located relative to this stack pointer and emits access to these variables as on offset to the stack pointer. So it is primarily the compiler mapping the variables, but it's the processor which is applying this mapping.

You can easily write programs which compute or remember a memory address which used to be valid, or is just outside of a valid region. The compiler doesn't stop you from doing so, only higher level languages with reference counting and strict boundary checks do at runtime.

Ext3h
  • 5,713
  • 17
  • 43
  • RE "For a given scope of execution, the compiler knows which variable is located relative to this stack pointer": That is so not true for function arguments coming from a different translation unit ;-). The compiler totally depends on what you tell it. *Usually* that information is communicated through a common header for caller and callee which describes the arguments, but that is just a convention. C++ compilers encode the arguments in the function names in order to facilitate function overloading, so you would create undefined symbols. But in C a function name doesn't carry any information. – Peter - Reinstate Monica Dec 24 '17 at 11:17
0

The compiler keeps track of all type information during translation, and it will generate the proper machine code to deal with data of different types or sizes.

Let's take the following code:

#include <stdio.h>

int main( void )
{
  long long x, y, z;

  x = 5;
  y = 6;
  z = x + y;

  printf( "x = %ld, y = %ld, z = %ld\n", x, y, z );
  return 0;
}

After running that through gcc -S, the assignment, addition, and print statements are translated to:

    movq    $5, -24(%rbp)
    movq    $6, -16(%rbp)
    movq    -16(%rbp), %rax
    addq    -24(%rbp), %rax
    movq    %rax, -8(%rbp)
    movq    -8(%rbp), %rcx
    movq    -16(%rbp), %rdx
    movq    -24(%rbp), %rsi
    movl    $.LC0, %edi
    movl    $0, %eax
    call    printf
    movl    $0, %eax
    leave
    ret

movq is the mnemonic for moving values into 64-bit words ("quadwords"). %rax is a general-purpose 64-bit register that's being used as an accumulator. Don't worry too much about the rest of it for now.

Now let's see what happens when we change those longs to shorts:

#include <stdio.h>

int main( void )
{
  short x, y, z;

  x = 5;
  y = 6;
  z = x + y;

  printf( "x = %hd, y = %hd, z = %hd\n", x, y, z );
  return 0;
}

Again, we run it through gcc -S to generate the machine code, et voila:

    movw    $5, -6(%rbp)
    movw    $6, -4(%rbp)
    movzwl  -6(%rbp), %edx
    movzwl  -4(%rbp), %eax
    leal    (%rdx,%rax), %eax
    movw    %ax, -2(%rbp)
    movswl  -2(%rbp),%ecx
    movswl  -4(%rbp),%edx
    movswl  -6(%rbp),%esi
    movl    $.LC0, %edi
    movl    $0, %eax
    call    printf
    movl    $0, %eax
    leave
    ret

Different mnemonics - instead of movq we get movw and movswl, we're using %eax, which is the lower 32 bits of %rax, etc.

Once more, this time with floating-point types:

#include <stdio.h>

int main( void )
{
  double x, y, z;

  x = 5;
  y = 6;
  z = x + y;

  printf( "x = %f, y = %f, z = %f\n", x, y, z );
  return 0;
}

gcc -S again:

    movabsq $4617315517961601024, %rax
    movq    %rax, -24(%rbp)
    movabsq $4618441417868443648, %rax
    movq    %rax, -16(%rbp)
    movsd   -24(%rbp), %xmm0
    addsd   -16(%rbp), %xmm0
    movsd   %xmm0, -8(%rbp)
    movq    -8(%rbp), %rax
    movq    -16(%rbp), %rdx
    movq    -24(%rbp), %rcx
    movq    %rax, -40(%rbp)
    movsd   -40(%rbp), %xmm2
    movq    %rdx, -40(%rbp)
    movsd   -40(%rbp), %xmm1
    movq    %rcx, -40(%rbp)
    movsd   -40(%rbp), %xmm0
    movl    $.LC2, %edi
    movl    $3, %eax
    call    printf
    movl    $0, %eax
    leave
    ret

New mnemonics (movsd), new registers (%xmm0).

So basically, after translation, there's no need to tag the data with type information; that type information is "baked in" to the machine code itself.

John Bode
  • 119,563
  • 19
  • 122
  • 198