How to load the largest integer possible in one memory operation?

Question

I'm building a small bytecode VM that will run on a variety of platforms including exotic embedded and microcontroller environments.

Each opcode in my VM can be variable length(no more than 4 bytes, no less than 1 byte). In interpreting the opcodes, I want to create a tiny "cache" for the current opcode. However, due to it being used on many different platforms, it's hard to do.

So, here is a few examples of expected behavior:

On an 8-bit microcontroller with an 8-bit memory bus, I'd want it to only load 1 byte because it'd take multiple (slow) memory operations to load anymore, and in theory, it might only require 1 byte to execute the current opcode
On an 8086(16-bit), I'd want to load 2 bytes because to only load 1 byte we would basically be throwing some useful data away to be read later, but I don't want to load more than 2 bytes because it'd take multiple operations
On a 32-bit ARM processor, I'd want to load 4 bytes because otherwise we're either throwing data that'd might have to be read again away, or we're doing multiple operations

I would say this could be handled easily by just assuming that unsigned int is good enough, but on 8-bit AVR microcontrollers, int is defined as 16-bit, but the memory data bus width is only 8 bit, so 2 memory load operations would be required.

Anyway, current ideas:

using uint_fast16_t seems to work as expected on most platforms (32 bits on ARM, 16 bits on 8086, 64 bits on x86-64). However, it clearly still leaves out AVR and other 8-bit microcontrollers.

I thought using uint_fast8_t might work, but it would appear on most platforms that it's defined as being unsigned char, which definitely isn't optimal

Also, there is another problem that must be solved as well: unaligned memory access. On x86, this probably isn't going to be a problem(in theory it does 2 memory operations, but it's probably cached away in hardware), however on ARM I know that doing an unaligned 32-bit access could possibly cost 3 times as much as a single aligned 32-bit load. If the address is unaligned, I want to load the aligned option and get as much data as possible, but at all costs avoid another memory operation

Is there a way to somehow do this using magical preprocessor includes or some such, or does it just require manually defining the optimum cache size before compiling for the platform?

Your building procedure (e.g. your `Makefile`) could run some configuring command to e.g. generate a header file containing appropriate `#define`... But I am not sure your "caching" idea will make your VM run faster... — Basile Starynkevitch, Apr 06 '13 at 06:05
@BasileStarynkevitch why would 4 discrete memory loads ever be faster than a single discrete load? Especially when you're throwing away data at an assembly-level. Also I'd really strongly like to avoid having to use any code generating configure stuff (I like my simple build process :) ) — Earlz, Apr 06 '13 at 06:06
Because the compiler may optimize them, and because perhaps most of the VM interpretation cost is elsewhere... If you don't want to generate configurable stuff your cannot solve your issue... Also, your 4 bytes load will usually be unaligned, and that has a significant performance cost. You are trying to micro-optimize manually, which is a almost always a bad thing to do (especially if you have not benchmarked it). Leave micro-optimizations to the compiler. — Basile Starynkevitch, Apr 06 '13 at 06:07
@BasileStarynkevitch I thought about this, but it didn't make it into the question. This has been added as criteria now as well. In my experience with making VMs in the past, this is a critical part though. I also don't see how any amount of static analysis can get the compiler to figure this out, especially because if the compiler did such an optimization without my control, it'd break my code (I have to be able in certain cases to invalidate this opcode cache) — Earlz, Apr 06 '13 at 06:15
@BasileStarynkevitch you hold some good points though. I think I'm going to create both with and without this opcache and see which one is faster — Earlz, Apr 06 '13 at 06:26
And we really can't help much, because you are not showing any source code (and then, your question becomes off-topic here). — Basile Starynkevitch, Apr 06 '13 at 06:31
So, what you need to know at build-time are actually two things: 1. The width of the memory bus (or maybe the native CPU registers), and 2. the optimum alignment of data access in the host's memory. - Are you going pure C, or are you ready to use some (inline-)assmbler too? Which compiler are you targeting? - Maybe [this](http://gcc.gnu.org/onlinedocs/gcc/Alignment.html) can help in some way. — JimmyB, Apr 08 '13 at 13:41

score 0 · Answer 1 · answered Apr 06 '13 at 09:20

There is no automatic way to do this using the types or information provided by standard C (in headers such as and so on).

Problems such as this are sometimes handled by executing and measuring sample code on the target platform and using the results to determine what code to use in practice. The samples might be executed during a build and then built into the final code or might be executed at the start of each program execution and then used for the duration of execution.

How to load the largest integer possible in one memory operation?

1 Answers1