Is XMMATRIX efficient for 2D transformations or should I make a custom 3x3 matrix suite?

Question

I'm building a high-performance UI layout engine on top of Direct3D 11. The application is being developed using Visual Studio 2013, targeting x64 and is intended for Windows 7 (with Platform Update) and up.

I need to do matrix transformations on 2D elements in the visual tree and I am wondering whether using DirextXMath's built-in (SIMD-optimized) XMMATRIX and its related functions is efficient for 2D use (as that only requires a 3x3 matrix while XMMATRIX et al is 4x4), or whether I should roll my own matrix class / functions (probably without any SIMD-specific code, though).

It seems to me that a 4x4 matrix throughout would mean a lot of redundant calculations being performed, but then again that might be offset by SIMD instructions when compared to non-SIMD 3x3 matrix work.

Edit: Comments about how "premature optimization is the root of all evil" (and derivatives thereof) are superfluous here (and ironically premature, since you know nothing about the project - or me). The question sums up what I am interested in some viewpoints on / knowing more about.

Why not start with the 4x4, see if it creates a bottleneck, and then rewrite it if it turns out to be the most significant issue in performance analysis? Rule #1 of optimization: *don't do it*. Rule #2 of optimization (experts only!) *don't do it...yet*. — HostileFork says dont trust SE, Oct 01 '14 at 12:49
The 4x4 matrix multiplications aren't a bottleneck in AAA games handling many, many, many objects composed of many, many, many triangles. I think it's safe to assume you will do just fine. In fact, I wouldn't be surprised if your hand-rolled matrix math was less efficient than the library one. — , Oct 01 '14 at 12:51
Those games are performing 3D transformations, so they need a 4x4 matrix. Since I am doing 2D transformations I get by with a 3x3 matrix. A 3x3 matrix multiplication involves 27 internal multiplications. A 4x4 matrix does 64. That is more than twice as many. In any case - I am interested in the concepts and principles that govern the answer to this as much as "doing fine". If one method is better than another, there is nothing "premature" about employing one over the other from the start. — d7samurai, Oct 01 '14 at 13:04
A 4x4 matrix can be optimized more easily through `NEON` and `SSE/AVX`. It's much nicer aligned and in size for almost all CPU architectures your code will potentially run on. — JustSid, Oct 01 '14 at 13:14
@d7samurai In addition to what JustSid wrote, those k multiplications can map to wildly different instruction sequences with *very* different performance behavior. Memory layout, vectorization, data dependencies, etc. are all reasons why an optimized 4x4 multiplication might be competitive with a simple 3x3 multiplication in performance. Also, I disagree with your interpretation of "premature"; if it never brings any tangible benefit, any additional effort put into it is premature and wasted. — , Oct 01 '14 at 13:46
@d7samurai Under that definition, you can never discuss an optimization as a hypothetical, since it may turn out to not improve performance. That seems unsatisfying. Also, it's possible that a change has a *measurable but irrelevant* performance benefit. Anyway, we've fallen to the level of nitpicking. — , Oct 01 '14 at 14:06
@delnan Let's put it this way: Sometimes a question is an invitation to a _benefit_ analysis - not a _cost/benefit_ analysis (i.e. the assessment of whether the potential benefit is worth the effort required to reach it is irrelevant). — d7samurai, Oct 01 '14 at 14:24
@d7samurai "If it never does, it isn't an optimization." Not necessarily. Consider a video transcoder that operates at 100FPS. If you optimize it to run at up to 120FPS, all other things being equal, you've made an optimization. But if the transcoder is only ever used to transcode live video up to 60FPS, under no circumstances does the optimization provide any actual benefit. — MooseBoys, Oct 01 '14 at 17:31
@MooseBoys I see your point, but if it is optimized to perform the same task faster - all other things being equal - it will tax the system less, which is an actual benefit. — d7samurai, Oct 01 '14 at 17:37
Without going into detail, suffice to say that in this particular case, the layout engine is running in tandem with another, integrated engine, visualizing what is happening in the latter graphically. The less the layout engine is stealing processor cycles from / holding back the "main" engine, the better. — d7samurai, Oct 01 '14 at 17:44

score 1 · Answer 1 · answered Oct 01 '14 at 16:32

Layout engines tend to have a lot of chained transformations, so using (and keeping for the duration of the chain) your data in SSE registers is likely to improve performance (even more so than typical game scenarios which usually only have a handful of chained transformations). If you are specifically not going to use SSE in your custom class, then XMMATRIX will probably be faster. The column difference shouldn't really matter much since each row fits in an SSE register, but the row difference will mean an extra load. Still, the benefit of SSE is probably worth it.

That said, many modern compilers auto-vectorize now, so a custom class you write in vanilla C++ might end up getting SSE-optimized behind the scenes anyway.

Either way, you probably won't see any difference in the performance if you haven't already optimized your engine for caching behavior. For example, if your engine represents the hierarchy using pointers, and you just allocate new elements on the heap whenever you need them, you'll thrash the cache and have plenty of time to calculate transformations while you wait for memory, SSE or not.

Is XMMATRIX efficient for 2D transformations or should I make a custom 3x3 matrix suite?

1 Answers1