4

I'm following Jack Crenshaw's compiler tutorial (If you look at my profile, that's what all my questions are about lol) and it just got to the point where variables are introduced. He comments that the 68k requires everything to be "position-independent" which means it's "PC-relative". I get that PC is the program counter, and on x86 it's EIP. But he uses syntax like MOVE X(PC),D0 where X is a variable name. I've read a little ahead and it says nothing later about declaring a variable in .data. How does this work? To make this work in x86, what would I replace X(PC) with in MOV EAX, X(PC)?

To be honest I'm not even sure this is supposed to output working code yet, but up to this point it has and I've added code to my compiler that adds the appropriate headers etc and a batch file to assemble, link and run the result.

rpatel3001
  • 147
  • 1
  • 11
  • You can't do that on x86 (you could on x64), there is just no way to express an EIP-relative address. – harold Aug 26 '13 at 15:41
  • @harold Well, I'm working on a 64 bit machine so I might as well make it easier for myself. What's the syntax for doing so, and what changes would I have to make to the header? – rpatel3001 Aug 26 '13 at 15:43
  • I'm not sure, I don't really use MASM. Maybe `[rel labelname]`? Maybe `[rip + offset]`? Don't forget the `dword ptr`-nonsense. If you're outputting bytes instead of text, you'd output a ModRM byte where the mod field is zero and the RM field is 5, and the offset is an sdword after that. – harold Aug 26 '13 at 15:50
  • You could also just use absolute addressing of course (optionally with relocation information). – harold Aug 26 '13 at 15:59
  • @harold but how would I access the variable using a name without having declared it first? And doesn't (E/R)IP change based on what instruction is being executed? – rpatel3001 Aug 26 '13 at 16:50
  • RIP does change yes, so really, using RIP-relative addressing is more annoying to manage than using absolute addresses (but you can let the assembler worry about that, if you go through text). Not sure what you mean about using without declaring.. Fundamentally though, a variable is just a location somewhere in RAM. Think of RAM as an array, and a variable as an element (or several adjacent elements) in that array. Names are irrelevant. – harold Aug 26 '13 at 17:47
  • @harold What I mean is that normally you have to say something like `variablename dd 12345` in the .data section to allocate it right? I apologize if I'm wrong, I'm new to the way this works. If I don't do that and just assign some random location in memory, say [rip+64], wouldn't that mess with something else? Or does all of that get taken into account during assembling and linking? And even then how would I later reference the variable without remembering the name somehow. It seems like saying X(PC) is pretty convenient in that regard. – rpatel3001 Aug 26 '13 at 17:53

2 Answers2

9

Here's a short overview over what a statically allocated global variable (which is what this question is about) really is and what to do about them.

What is a variable anyway

To the machine, there is no such thing as a variable. It never hears about them, it never cares about them, it just has no concept of them. They're just a convention to assign a consistent meaning to a particular location in RAM (in the case of virtual memory, a position in your address space).

Where you actually put a variable, is sort of up to you - but within reason. If you're going to write to it (and you probably are), it had better be in a writable location, which means: the address of that variable should fall within a memory area that is allocated and writable. The .data section is just an other convention for that. You don't have to call it that, you don't even need a separate section (you could make your .text section writable and allocate your globals there, if you really wanted), you could even use OS functions like VirtualAllocEx (or equivalent) to allocate memory at a fixed position and use that (but don't do that). It's up to you. But the .data section is a convenient place to put them.

"Allocating" the variables is just a matter of choosing an address such that the variable doesn't overlap with any other variable. That's not hard, just lay them out sequentially: start a pointer var_ptr at the beginning of wherever you're going to put them (so the VA of your .data section, or 0 if you're using a linker), and then for every variable v:

  • the location l of v is align(var_ptr, round_up_to_power_of_2(sizeof(v)))
  • set var_ptr to l + sizeof(v)

As a minor variation, you could skip the alignment (most compiler textbooks do that, but in real life you should align). x86 usually lets you get away with that.

As a bigger variation, you could try to "fill the holes" left by the alignments. The simplest way to fill at least most holes is to just sort the variables biggest-first (that fills all holes if all sizes are powers of two). While that may save some space (though not necessarily any, because sections are aligned themselves), it never saves much. Under the usual alignment rules the "just lay them out sequentially"-algorithm will, at worst, waste nearly half the space it uses on holes. The pattern that leads to that is an alternating sequence of the smallest type and the biggest type. And let's be honest, that wouldn't really happen - and even if it did, that's not all that bad.

Then, you have to make sure that the .data segment is big enough to hold all variables, and that the initial contents match what the variables were initialized with.

But you don't even have to do any of this. You can use variable declarations in the assembly code (you know how to do this), and then the assembler/linker (they typically both play a roll in this) will do all of this for you (and, of course, it will also do the replacement of variable names by variable addresses).

How to use a variable

It depends. If you're using an assembler/linker, just refer to the label that you gave the variable. The label, of course, does not have to match the name in the source code, it can be any legal unique name (for example, you could use the AST node ID of the declaration with an underscore in front of it).

So loading a variable could look like this:

mov eax, dword ptr [variablelabel]

Or, on x64, perhaps this

mov eax, dword ptr [rel variablelabel]

Which would emit a rip-relative address. If you do that, you don't have to care about the current value of RIP or where the variable is allocated, the assembler/linker will take care of it. On x64, using a RIP-relative address like that is common, for several reasons:

  • it allows the .data segment to be somewhere that isn't the first 4GB (or 2GB) of address space, as long as it's close to the .text segment
  • it's shorter than an instruction with an absolute 64bit address
  • there are only two instructions that even take an absolute 64bit address, namely mov rax,[imm64] and mov [imm64],rax
  • you get relocations for free

If you're not using an assembler and/or linker, it becomes (at least to some extend) your own job to replace variable-names by whatever address you allocated for them (if you're using a linker but no assembler, you'd make relocation data but you wouldn't yourself decide on the absolute addresses of variables).

When you're using absolute addresses, you can "put them in" in parallel with emitting instructions (provided you've already allocated the variables). When you're using RIP-relative addresses, you can only put them in once you decide where the code will be (so you'd emit code where the offsets are 0, do some bookkeeping, decide where the code will be, then you go back and replace the 0's by the real offsets), which is a non-trivial problem in itself unless you use a naive way and don't care about branch-size-optimization (in that case you know the address of an instruction at the time you emit it, and therefore what the offset of a variable relative to RIP would be). A RIP-relative offset is easy enough to calculate, just subtract the RIP of the position immediately after the current instruction from the VA (virtual address) of the variable.

But that's not all

You may want to make some variables non-writable, to the point that any attempt to write to them in "funny ways that the compile can't detect" will fail. That can be accomplished by putting them in a read-only section, typically called .rdata (but the name is irrelevant really, what matters is whether the "writable" flag of the section is set in the PE header). This isn't done often, though it is sometimes used for string or array constants (which aren't properly variables).

What is done regularly, is putting zero-initialized variables in their own section, a section that takes no space in the executable file but is instead simply zeroed out. Putting zero-initialized variables may save some space in the executable. This section is commonly called .bss (not short for bullsh*t section), but as always, the name is irrelevant.

More

Most compiler textbooks deal with this subject to varying amounts, though usually not in much detail, because when you get right down to it: static variables aren't hard. Certainly not compared most other aspects of compilations. Also, some aspects are very platform specific, such as the details around the sections and how things actually end up in an executable.

Some sources/useful things (I've found all of these useful while working on compilers):

harold
  • 61,398
  • 6
  • 86
  • 164
  • Wow thanks for writing all that up. So, no matter how I access the memory, I have to declare that much space being used in .data (being new to the whole assembly thing I don't quite understand how/where you would set read/writability of a section)? Since there's no way to do it the (easier) 68k way, I'll probably end up adding the variable to .data. – rpatel3001 Aug 27 '13 at 02:19
4

Many processors support PC-Relative or Absolute addressing.

On X86 machines however there is the following restriction:

  • Jumps and Calls are always PC-Relative (unless register-based)
  • Other adresses are always Absolute (unless register-based)

C compilers that can do PC-Relative addressing will implement this the following way:

  CALL x
x:
  ; Now address "x" is on the stack
  POP EDI
  ; Now EDI contains address of "x"
  ; Now we can do (pseudo-)PC-Relative addressing:
  MOV EAX,[EDI+1234]

This is used if the address of the code in the memory is not known during compile/linking time (e.g. for dynmaic libraries (DLLs) under Linux) so the address of a variable (here located at address "x+1234") is not known, yet.

Martin Rosenau
  • 17,897
  • 3
  • 19
  • 38
  • So to use variables by name, I'd have to add a label corresponding to the name of each, and do the call/pop/mov everytime I want to access it? And what determines the offset from EDI to use? – rpatel3001 Aug 26 '13 at 16:54
  • The x86 doesn't have relative addressing in the 32-bit mode, but it does in 64-bit mode. See http://wiki.osdev.org/X86-64_Instruction_Encoding#RIP.2FEIP-relative_addressing – ataylor Sep 01 '16 at 21:59