AVX equivalent for _mm_storeu_ps?

Question

I have quite a fast AVX code, but it's just one single function using AVX, the rest of the huge project is on SSE2, so I do NOT want to set architecture to AVX. At the end of each iteration I need to convert the 4 doubles in one YMM register to 4 floats and store it like this:

__m256d y = ......;
_mm_storeu_ps((float*)dst + i, _mm256_cvtpd_ps(y));

But MSVC is generating SSE2 code using movups (without "v" prefix). Is there a way to force it to just use one AVX instruction? It seems to me quite ridiculous that the only know way is to use AVX as target. I want to take advantage of AVX for just a single cycle. Intel compiler apparently understands it and since I'm using AVX autodispatch it works well there, but generally Intel compiler doesn't seem to way to go right now, it's slow and the code is worse than MSVC, well, except for this...

Mercutio Calviary · Answer 1 · 2015-04-01T03:54:34.190

The AVX equivalent of _mm_storeu_ps(float*,_mm128) is _mm256_storeu_ps(float*,_mm256). To start of with, AVX is an instructions set. That means that a CPU has to physically have the register space and FPU front end to be able to use them. In other words AVX given to a non-AVX processor wouldn't run. You have to compile with /arch -AVX because it is the latest and it is backwards compatible with sse,sse2,sse3,ssse3,sse4,sse,sse4.1,sse4.2 but the inverse is not true. A AVX CPU inplements all the sses' but AVX is an instruction set foreign to sse CPUs.

The memory operations are not particularly optimizable on a large scale due to CPU<-->RAM bandwidth shortcomings. If you are using very large arrays then you will not see a difference in the slightest between SSE4 loadu and AVX loadu, a 6 lane highway will only every have the capacity to handle X many cars no matter how many cars you ask from it at one time. This tends not to be the case with smaller arrays when you can hide the latency of loading them behind other work. You should not swap AVX instructions and SSE not only due to design complications but also internal cpu complications.

Furthermore you do not want to go around changing the FPU state from AVX to SSE because that in itself causes a performance overhead that most people do not consider. Either have large homologous sections of one code be in one state (SSE,AVX) or have everything SSE, or AVX

Grammar warning I can't Grammar

I know, I'm performing my own detection if AVX is possible. My question was if there is a way to make the compiler generate AVX prefixed instruction for using XMM register, not YMM! I want write 4 floats, not 8. And this is all exactly to avoid the AVX->SSE switch penalty. — mrzacek mrzacek, Apr 01 '15 at 11:42
If you want to write 4 floats an the main body of your code is SSE code then I do not understand why you want this to be bit to be AVX. if you only want to write 4 floats use a masked store operation, or just overwrite some if the information later. But on another note, processors with AVX have only YMM registers, XMM0-XMM7 were renamed YMM0-YMM7. So if you see the disassembly playing around with those registers it is effectivly the same thing. — Mercutio Calviary, Apr 01 '15 at 17:24
Masked store, interesting, I'll check it out, though I heard these masked instructions are very very slow. — mrzacek mrzacek, Apr 02 '15 at 21:27

AVX equivalent for _mm_storeu_ps?

1 Answers1