I have quite a fast AVX code, but it's just one single function using AVX, the rest of the huge project is on SSE2, so I do NOT want to set architecture to AVX. At the end of each iteration I need to convert the 4 doubles in one YMM register to 4 floats and store it like this:
__m256d y = ......;
_mm_storeu_ps((float*)dst + i, _mm256_cvtpd_ps(y));
But MSVC is generating SSE2 code using movups (without "v" prefix). Is there a way to force it to just use one AVX instruction? It seems to me quite ridiculous that the only know way is to use AVX as target. I want to take advantage of AVX for just a single cycle. Intel compiler apparently understands it and since I'm using AVX autodispatch it works well there, but generally Intel compiler doesn't seem to way to go right now, it's slow and the code is worse than MSVC, well, except for this...