2

I multiply and round four 32bit floats, then convert it to four 16bit integers with SSE intrinsics. I'd like to store the four integer results to an array. With floats it's easy: _mm_store_ps(float_ptr, m128value). However I haven't found any instruction to do this with 16bit (__m64) integers.

void process(float *fptr, int16_t *sptr, __m128 factor)
{
  __m128 a = _mm_load_ps(fptr);
  __m128 b = _mm_mul_ps(a, factor);
  __m128 c = _mm_round_ps(b, _MM_FROUND_TO_NEAREST_INT);
  __m64 s =_mm_cvtps_pi16(c);
  // now store the values to sptr
}

Any help would be appreciated.

plasmacel
  • 8,183
  • 7
  • 53
  • 101

3 Answers3

3

Personally I would avoid using MMX. Also, I would use an explicit store rather than implicit which often only work on certain compilers. The following codes works find in MSVC2012 and SSE 4.1.

Note that fptr needs to be 16-byte aligned. This is not a problem if you compile in 64-bit mode but in 32-bit mode you should make sure it's aligned.

#include <stdio.h>
#include <stdint.h>
#include <smmintrin.h>

void process(float *fptr, int16_t *sptr, __m128 factor)
{
  __m128 a = _mm_load_ps(fptr);
  __m128 b = _mm_mul_ps(a, factor);
  __m128i c = _mm_cvttps_epi32(b);
  __m128i d = _mm_packs_epi32(c,c);
  _mm_storel_epi64((__m128i*)sptr, d);
}

int main() {
    float x[] = {1.0, 2.0, 3.0, 4.0};
    int16_t y[4];
    __m128 factor = _mm_set1_ps(3.14159f);
    process(x, y, factor);
    printf("%d %d %d %d\n", y[0], y[1], y[2], y[3]);
}

Note that _mm_cvtps_pi16 is not a simple instrinsic the Intel Intrinsic Guide says "This intrinsic creates a sequence of two or more instructions, and may perform worse than a native instruction. Consider the performance impact of this intrinsic."

Here is the assembly output using the MMX version

mulps   (%rdi), %xmm0
roundps $0, %xmm0, %xmm0
movaps  %xmm0, %xmm1
cvtps2pi    %xmm0, %mm0
movhlps %xmm0, %xmm1
cvtps2pi    %xmm1, %mm1
packssdw    %mm1, %mm0
movq    %mm0, (%rsi)
ret

Here is the assembly output ussing the SSE only version

mulps   (%rdi), %xmm0
cvttps2dq   %xmm0, %xmm0
packssdw    %xmm0, %xmm0
movq    %xmm0, (%rsi)
ret
Z boson
  • 32,619
  • 11
  • 123
  • 226
  • This is exactly what I want! However _mm_packs_epi32 should be used instead of _mm_packus_epi32 to keep signed values, or am I wrong? – plasmacel Feb 26 '14 at 16:01
  • 3
    Besides the fact that SSE is faster, using MMX opens you to the possible bug of omitting EMMS (as in this example), which is a serious bug (and seriously hard-to-diagnose when some apparently unrelated FP computation millions of cycles later starts misbehaving). Just say no to MMX. – Stephen Canon Feb 26 '14 at 16:21
2

With __m64 types, you can just cast the destination pointer appropriately:

void process(float *fptr, int16_t *sptr, __m128 factor)
{
  __m128 a = _mm_load_ps(fptr);
  __m128 b = _mm_mul_ps(a, factor);
  __m128 c = _mm_round_ps(b, _MM_FROUND_TO_NEAREST_INT);
  __m64 s =_mm_cvtps_pi16(c);
  *((__m64 *) sptr) = s;
}

There is no distinction between aligned and unaligned stores with MMX instructions like there is with SSE/AVX; therefore, you don't need the intrinsics to perform a store.

Jason R
  • 11,159
  • 6
  • 50
  • 81
  • MSDN says __m64 types are not supported on x64 processors. What does it exactly mean? According to http://msdn.microsoft.com/en-us/library/08x3t697.aspx – plasmacel Feb 26 '14 at 13:50
  • 4
    @plasmacel: I believe that is just a limitation of Visual Studio's 64-bit compiler (not sure if it's any kind of Windows limitation). I have production code in use now that uses MMX instructions on x86-64 architecture machines (on Linux, built using gcc or Intel C++). – Jason R Feb 26 '14 at 13:58
  • 1
    Rather than using `__m64`, you can simply stick with `__m128` and use `_mm_storel_epi64` (`MOVQ`) to store the low 64 bits. There isn't any really good reason to use MMX today. – Stephen Canon Feb 26 '14 at 14:44
  • @StephenCanon: That will store two 32bit values (from the low 64bit), rather than all the four values with 16bit precision. – plasmacel Feb 26 '14 at 15:16
  • @plasmacel: It stores the low 64 bits. It doesn’t care if they’re two 32-bit integers or four 16-bit integers. – Stephen Canon Feb 26 '14 at 16:17
  • @StephenCanon: For the above application, wouldn't you need to first move the 64-bit result back to an SSE register? That might add additional overhead that isn't really buying you anything. I do agree with you that MMX has few applications today, though. – Jason R Feb 26 '14 at 16:35
  • 1
    You would never pass the data through MMX at all (by using `_mm_cvtps_epi32` + `_mm_packs_epi32` instead of `_mm_cvtps_pi16` as shown in Z Boson’s answer; despite needing two intrinsics instead of one, this is actually more efficient). – Stephen Canon Feb 26 '14 at 16:43
1

I think you're safe moving that to a general 64bit register (long long will work for both Linux LLP64 and Windows LP64) and copy it yourself.

From what I read in xmmintrin.h, gcc will handle the cast perfectly fine from __m64 to a long long. To be sure, you can use _mm_cvtsi64_si64x.

short* f;
long long b = _mm_cvtsi64_si64x(s);
f[0] = b >> 48;
f[1] = b >> 32 & 0x0000FFFFLL;
f[2] = b >> 16 & 0x000000000FFFFLL;
f[3] = b & 0x000000000000FFFFLL;

You could type pune that with an union to make it look better, but I guess that would fall in undefined behavior.

NewbiZ
  • 2,395
  • 2
  • 26
  • 40
  • I haven't found any reference about _mm_cvtsi64_si64x. Nor on http://software.intel.com/sites/landingpage/IntrinsicsGuide – plasmacel Feb 26 '14 at 15:31
  • As I see in a custom header file it's simply implemented as a cast: _mm_cvtsi64_si64x(__m64 __i) { return (long long)__i; } – plasmacel Feb 26 '14 at 15:40