I'm looking for a SIMD library focused small (4x4) matrix operations for graphics. There's lots of single precision ones out there, but I need to support both single and double precision.
I've looked at Intel's IPP MX library, but I'd prefer something with source. I'm very interested in SSE3+ implementations of these particular operations:
- Mat4 * Mat4
- Mat4 * Vec4
- Mat4 * Array of Mat4
- Mat4 * Array of Vec4
- Mat4 inversion (nice to have)
EDIT: No "premature optimization" answers please. Anyone who has worked with small matrices knows GCC does not vectorize these as well as hand optimized intrinsics or ASM. And in this case it's important, or I wouldn't be asking.