11

I'm looking for a SIMD library focused small (4x4) matrix operations for graphics. There's lots of single precision ones out there, but I need to support both single and double precision.

I've looked at Intel's IPP MX library, but I'd prefer something with source. I'm very interested in SSE3+ implementations of these particular operations:

  1. Mat4 * Mat4
  2. Mat4 * Vec4
  3. Mat4 * Array of Mat4
  4. Mat4 * Array of Vec4
  5. Mat4 inversion (nice to have)

EDIT: No "premature optimization" answers please. Anyone who has worked with small matrices knows GCC does not vectorize these as well as hand optimized intrinsics or ASM. And in this case it's important, or I wouldn't be asking.

David Heffernan
  • 601,492
  • 42
  • 1,072
  • 1,490
Justicle
  • 14,761
  • 17
  • 70
  • 94
  • Why all the down-votes ? Seems like a perfectly good question to me... – Paul R Apr 26 '11 at 15:50
  • 1
    The OP initially rejected two reasonable answers, then edited the question to justify one of the rejections, and eventually relented on the other rejection. The question is fine, but the asker's etiquette needs improvement. – user57368 Apr 27 '11 at 09:01
  • 4
    @user57368 Retaliatory downvotes eh? You asked why you got downvoted, you got it. How can that be construed as somehow misleading? - the edit is clearly marked. – Justicle Apr 27 '11 at 23:42

5 Answers5

9

Maybe the Eigen library?

It supports SSE 2/3/4, ARM NEON and AltiVec instruction set.

Karl von Moor
  • 8,484
  • 4
  • 40
  • 52
3

Eigen supports fixed size matrices. Small fixed size matrices can be allocated on stack for better performance. 4x4 is good for SSE, since SSE vector size is 128 bits. A row or a column of 4 double precision numbers would fit evenly into 2x128 bit SSE vectors. This makes SIMD implementation easy.

Another option is to code it yourself. Since your matrices are small and fit into L1 cache, you don't have to bother with memory titling needed for large matrices. You could use AVX for even better performance. Newer versions of GCC and Visual C++ 2010 support AVX intrinsics. AVX vector size is 256 bit can hold exactly 4 double precision numbers.

pic11
  • 14,267
  • 21
  • 83
  • 119
1

Not fully complete yet, but I wanted to pitch my own library - glsl-sse2.

LiraNuna
  • 64,916
  • 15
  • 117
  • 140
1

There's a 4x4 AVX implementation here. It's written as an example application but I'm sure it wouldn't be too hard for anyone to extract the interesting parts into a shared library. Thought I'd post this despite the age of the original question for anyone alighting here in the future.

Michael_73
  • 353
  • 3
  • 11
-4

If you're using a modern compiler, you probably don't need to bother. Automatic vectorization from most compilers should be able to easily transform for loops with fixed bounds in to SIMD code. GCC has had this for quite a while, and it is one of the main selling points of Intel's compiler (though you should be careful about using Intel's compiler if you might want to use AMD chips).

user57368
  • 5,675
  • 28
  • 39
  • Do you have examples of where you may wish to be careful if using an Intel compiler with an AMD chip? – rcollyer Apr 21 '11 at 20:01
  • 2
    Intel's gotten in legal trouble in the past for ICC checking the vendor string returned by the CPUID instruction instead of relying only on checks for SSEx support, which meant that ICC-generated code wouldn't use SSEx code-paths on non-Intel machines. A quick look at the current documentation shows that you can force the use of up to SSSE3 for non-Intel CPUs, but if you want to use run-time code-path selection, it will still use the slowest option on non-Intel CPUs. – user57368 Apr 21 '11 at 20:24
  • did not know that. have to check compare ifort vs. open64 (and others) on my opteron system. – rcollyer Apr 21 '11 at 20:36
  • Care to provide a bit more information as to why? (And presumably why the downvote?) – user57368 Apr 21 '11 at 22:12
  • Downvoted because even now in 2016, modern compilers suck at automatic vectorization. Manually written assembly or intrincics code tends to be 2-3 times faster. – Soonts Feb 28 '16 at 12:22