SSE2 Saturated Arithmetic

Question

I'm writing some audio processing software and I need to know how to do saturated arithmetic with SSE2 double-precision instructions. My values need to be normalized between -1 and 1. Is there a clever way to do this with SSE2 intrinsic or do I need a 2 sets of if/else statements (one for each value)?

Why are you even using double precision for audio ? Anyway, you don't really need to saturate until you eventually convert back to whatever audio format you are using, at which point you can either use saturating pack instructions (if it's an integer format) or max/min instructions if you want to do it explicitly. — Paul R, Jul 06 '15 at 08:46
Well the audio format can either be processed as an int32, int64, float32, and float64. I just happen to be doing the float64 part right now. — Caleb Merchant, Jul 06 '15 at 12:35

Paul R · Accepted Answer · 2015-07-06T13:53:19.767

6

To clip double precision values to a range of -1.0 to +1.0 you can use max/min operations. E.g. if you have a buffer, buff, of N double values:

const __m128d kMax = _mm_set1_pd(1.0);
const __m128d kMin = _mm_set1_pd(-1.0);

for (int i = 0; i < N; i += 2)
{
    __m128d v = _mm_loadu_pd(&buff[i]);
    v = _mm_max_pd(v, kMin);
    v = _mm_min_pd(v, kMax);
    _mm_storeu_pd(&buff[i], v);
}

edited Jul 06 '15 at 13:53

answered Jul 06 '15 at 13:16

Paul R

208,748
37
389
560

NICE! That's really interesting. Thanks for the help – Caleb Merchant Jul 06 '15 at 23:13
1

Wow...I just found something really interesting. All the intrinsic functions made it slower. And the more I used the slower it was. Using only primitive types (doubles) I did 500000 addition operations in 1738 nano seconds. Using SSE2 only for the addition I got 5198 nano seconds. Using your answer above I got 31888 nano seconds. That makes no sense to me. Looking at the disassembly though, they used the xmm registers. Could it be the fact that the compiler knows how to optimize it better when it does everything? – Caleb Merchant Jul 06 '15 at 23:52
Two possible explanations - (1) you're using a debug build no optimisation (i.e. `-O0`) for timing rather than a release build (`-O3`) and/or (2) your compiler is already vectorising the scalar code. – Paul R Jul 07 '15 at 05:50
I'm using Visual Studio and I compiled for Maximize Speed. And I believe it is vectorizing the code. – Caleb Merchant Jul 07 '15 at 05:52
That is indeed possible - you could always post a new question with your actual code (both the scalar code and the vectorised code) and see if anything can be improved upon. – Paul R Jul 07 '15 at 05:54
1

If you're doing a separate pass over the buffer to min/max it, instead of doing it at the end of your regular computation routine when the values are already in a vector register, that will be slower. (esp. if you work in chunks larger than L2 cache.) My guess is you're probably using intrinsics in a way that's forcing the compiler to store to mem, then load, or something like that. Often just letting the auto-vectorizer do a good job is your best bet. Using intrinsics youself is more often needed when you need clever shuffles. – Peter Cordes Jul 08 '15 at 20:49

SSE2 Saturated Arithmetic

1 Answers1