Big Picture
I'm writing some code to simulate a computer at the transistor level. The emulator boils down to a graph where each node is a transistor, and each edge is a wire connecting any two transistor nodes on the graph. This graph is cyclic, and transistor nodes may be connected to themselves.
To run a single "step" of the emulator, two separate functions are run:
- Each wire edge is processed, setting the input of its target node from the output of its source node. Each wire is visited exactly once each step, but a transitor may be visited multiple times.
- Each transistor node output state is updated from its input's states (how is outside the scope of this question, I'm pretty sure I'm doing it efficiently).
I believe I have the second step optimised, but I need help making the first step more efficient.
Implementation
The code looks roughly like this:
type InputBit = usize;
type OutputBit = usize;
struct Emulator {
inputs: Vec<u64>,
outputs: Vec<u64>,
wires: Vec<(InputBit, OutputBit)>,
}
impl Emulator {
fn step(&mut self) {
step_wires(&mut self);
step_transistors(&mut self);
}
fn step_wires(&mut self) {
for (input, output) in self.wires.iter() {
// NB omitted bit-twiddling to get bit indices
self.outputs[output] = self.inputs[input];
}
}
fn step_transistors(&mut self) {
// ... omitted for brevity ...
}
}
Each transistor node N
is composed of two input bits at bit 2N
and 2N+1
in self.inputs
, and two output bits at 2N
and 2N+1
in self.outputs
.
The problem as I see it, is that my list of wires (and the transistors) is in arbitrary order. This means it's really cache inefficient. For example, imagine this set of wires (input node bit, output node bit):
[
(0, 1000),
(1000, 2000),
(1, 1001),
(1001, 2001),
]
If my memory cache size is < 1000 bits, that means I get a cache miss for most of the reads and writes. If they were reorganised into:
[
(0, 1000),
(1, 1001),
(1000, 2000),
(1001, 2001),
]
Then it's less cache misses. Equally, I could also "move" the transistor nodes to give the following equivalent graph:
[
(0, 2),
(1, 3),
(2, 4),
(3, 5),
]
Which now only uses one cache line! Much faster. (Note this example is slightly misleading, as the node indices will be densely packed, i.e. there wont be any "empty" node indices that are unused, but it is fine to "swap" nodes).
The Question
What algorithm can I use to chose which order I visit the wire edges, and/or reorder the transistor node indices, so that I have the minimum number of cache misses while traversing the graph?
I think something that reduces the total "distances" of all the edges would be a good start? And then something that sorts the edges so that ones which are entirely within a single cache line are visited first, in cache line order and then do between-different-cache-line edges in some order?