0

Consider an architecture (namely, the Infocom Z-machine) with a large, readonly memory region (called "high memory") that is only intended to store strings (and machine code, but that doesn't pose a problem). This region can only be accessed by certain instructions that display text. Of course, this means that pointers to high memory can't be dereferenced.

I'd like to write an LLVM backend for this architecture. In order to do this, I need a way to tell the backend to store certain strings in high memory, and to obtain the "packed addresses" of said strings (also to convert the strings to the Z-Machine string encoding, but that's not the point).

Ideally, I'd be able to define a C function-like macro HIGHMEM_STRING which would take a string literal and expand to an integer constant. Supposing there's a function void print_paddr(uint16_t paddr), I'd like to be able to do:

print_paddr(HIGHMEM_STRING("It is pitch black. You are likely to be eaten by a grue."));

And then the backend would know to place the string in high memory and pass its packed address to print_paddr as a parameter.

My question has three parts:

  1. Can I implement such a macro using LLVM intrinsics, or an asm block with a special directive for the backend, or some other similar way without having to fork Clang? Otherwise, what would I have to change in Clang?
  2. How can I annotate the LLVM IR to convey to the backend that a string should be placed in high memory and replaced with its packed address?
  3. If HIGHMEM_STRING is too hard, or impossible, to implement as a macro, what are the alternatives?

1 Answers1

0

The Hexagon backend does something similar by storing information in a special section whos base address is loaded in the GP register and the referencing instruction has an offset inside that section. Look for CONST64 to get an idea of how these are processed.

Basically when we identify the data in LLVM IR we want to put in this special section, we create a pseudo instruction with the data. When we are writing out the ELF file we switch sections to the GP-rel section, emit the data, then switch back to the text section and emit the instruction to dereference this symbol.

It'll probably be easier if you can identify these strings based on their contents rather than having the user specify them in the program text.

Colin LeMahieu
  • 610
  • 5
  • 7