The obvious thing would be to run it with some fully randomized test inputs, and compare against the result of a simple known-good implementation with the same data input. (e.g. written in C or your favourite high-level language, possibly just running on the host CPU, not inside the simulator). A simple implementation running inside your simulator would be good to have as well, or instead if that's easier.
When you compare results, you may need to allow some wiggle room for FP rounding errors if your simple implementation uses a different order of operations. Like a pretty standard thing would be to check that the absolute differences are all within 1e-7 or something, or check relative differences (although relative-error can be large for numbers near zero that resulted from subtraction; catastrophic cancellation is a known problem for FP).
(See also https://randomascii.wordpress.com/2012/02/25/comparing-floating-point-numbers-2012-edition/ and the rest of Bruce's series of FP articles if you're not already aware of these issues.)
Perhaps worth having a reference implementation that computes in double-precision so you have a better idea what the actual correct answers are, when evaluating a computation with rounding errors.
Debugging when data doesn't match the reference:
Test again with very simple input data, like all 0.0
except a 1.0
in one element. That might highlight a wrong array indexing problem. Or all 1.0
, or all -2.0
.
Or some input that should produce a very simple output, for the known algorithm you're trying to implement. e.g. if most outputs are supposed to be 0.0
, seeing which ones aren't, or what value they have, could be a big hint.
Also note that most real-world CPUs have some kind of instruction cache, so it's usually worth a tiny bit of loop overhead (large unrolled loop) to recycle a loop body that fits in cache, instead of fully unrolling / peeling a loop into a huge block of straight-line code. (Like 90k instructions sounds like too much). But if there really isn't any simple repetition that can be amortized via unrolling, it's worth considering this.