PC-relative addressing on an assembly-like language compiler

Question

I'm currently writing a compiler for a custom asm-like programming language and I'm really confused on how to do proper PC-relative addressing for data labels.

main    LDA RA hello
        IPT #32
        HLT

hello   .STR "Hello, world!"

The pseudo-code above, after compilation, results in the following hex:

31 80 F0 20 F0 0C 48 65 6C 6C 6F 2C 20 77 6F 72 6C 64 21 00

3180, F020 and F00C are the LDA, IPT and HLT instructions.

As seen in the code, the LDA instruction uses the label hello as an argument. Which, when compiled, becomes the value 02, which means "Incremented PC + 0x02" (if you look at the code, that's the location of the "Hello, world!" line, relative to the LDA call. The thing is: .STR is not an instruction, as it only tells the compiler it needs to add a (0-terminated) string at the end of the executable, so, were there other instructions after the hello label declaration, that offset would be wrong.

But I can't find a way to calculate the right offset, other than having the compiler being able to travel through time. Do I have to "compile" it two times? First for the data labels, then for the actual instructions?

The offset to the declaration of `hello` would not be changes by more instructions after the `.STR` variable. — kdopen, Aug 26 '16 at 20:27

score 2 · Accepted Answer · answered Aug 26 '16 at 20:24

2

Yes, most assemblers are (at least) two-pass - precisely because of forward references like these. Adding macro capabilities can add more passes.

Look at an assembly listing, not just the op-codes. As you said the actual offset is "2", I'm assuming memory is word-addressed.

0000 3180   main    LDA RA hello
0001 F020           IPT #32
0002 F00C           HLT

0003 4865   hello   .STR "Hello, world!"

The first two columns are the PC and opcode. I'm not sure how the LDA instruction has been encoded (where is the +2 offset in there?)

In the first pass, assuming all addressing is relative, the assmebler would emit the fixed part of the op-code (covering the LDA RA part) along with a marker to show it needed to patch up the instruction with the address of hello in the second pass.

At this point it knows the size, but not the complete value, of the final machine language.

It then continues on, working out the address of each instruction and building its symbol table.

In the second pass, now knowing the above information, it patches each instruction by calculating relative offsets etc. It also often regenerates the entire output (including PC values).

Occasionally, something will be detected in the second pass which prevents it continuing. For example, perhaps you can only reference objects within 256 words (-127 thru +128), but the label hello turns out to be more than 128 words away. This means it should have used a two-word instruction (with an absolute address), which changes everything it learnt during the first pass.

This is often referred to as a 'fix up' error. The same thing can happen during the link phase.

Single pass assemblers are only possible if you insist on 'define before use'. In which case, your code would report hello as an undefined symbol.

You also need to read up on "program sections". Whilst .STR is not an executable instruction, it is a directive to the assembler to place the binary representation of the string into the CODE section of the image (vs DATA).

answered Aug 26 '16 at 20:24

kdopen

8,032
7
44
52

More likely PC-relative offsets are scaled by 2 in the machine encoding, not that memory is only word addressable. Also, it's normal to put read-only data like strings in the code section of an executable. (e.g. `.section .rodata` in gas for Unix platforms.) – Peter Cordes Aug 26 '16 at 20:32
Also possible, but it still implies there are no 8 or 24 bit instructions. – kdopen Aug 26 '16 at 20:34
Or that you need at most one byte of padding for anything you want to address with a PC-relative addressing mode. Seems worth it, since -128 .. +127B would be a pretty small range, esp. since you do usually want to separate code and data a bit. – Peter Cordes Aug 26 '16 at 20:35
1

Yeah, lots of simplifying assumptions made here :) – kdopen Aug 26 '16 at 20:36
BTW, 6800 was -127/+128 bytes for all conditional branches. Was fairly common in 8 bit machines. Actually, even the `BRA` (unconditional branch) was relative within an 8-bit signed offset. If you weren't certain, you used `JMP` – kdopen Aug 26 '16 at 20:38
The `LDA` instruction is encoded as 8-bit `0011 TR FS`, `TR` being 3-bits determining the register to use, and `FS` being the 9-bit signed offset. The `hello .STR "Hello, world!"` stores the string "Hello, World!" plus a 0x0 at the end of the compiled result. Let's say the program has 27 instructions and three strings are defined, `hello` being the last one. The first string data will start at address 0x1B, so the address of `hello` depends on the length of the previously stored data, but there are instructions looking for their offset relative to `hello` before it has been created. – William Fernandes Aug 26 '16 at 21:08
Hence the two passes – kdopen Aug 26 '16 at 23:45
I think I got it now... Thank you guys! – William Fernandes Aug 27 '16 at 00:01

PC-relative addressing on an assembly-like language compiler

1 Answers1