I have a legacy Windows DLL (written in c++) for which I need to maintain a 32-bit version, along with the 64-bit version. I'm updating the heavy math code with simd using Agner's vector class library, and seeing little or no speed improvements for the 32-bit version when compiling with AVX as compared to SSE4.2. I'm aware that with 32-bit code there are always only 8 vector registers available, but I'm not clear (after much searching) exactly what this means when compiling with AVX, AVX2 or AVX512. Are there compiler options (Microsoft or Clang) that will give me some worthwhile speed improvements over SSE4.2 (for simple loops of floating point operations), or should I just save myself some trouble and compile the 32-bit version with SSE4.2?
-
2If you don't plan to invest much time in hand optimizations of your code (e.g. to rewrite the hot code with intrinsics) then you already have your answer - you benchmark the code and keep whatever provides any meaningful benefit. – Andrey Semashev Dec 16 '21 at 18:00
-
Well, I'm not opposed to using intrinsics for this job, since it's pretty small and simple ... are you suggesting that I can (probably? maybe?) get a worthwhile speed boost with proper intrinsics coding that Agner's library doesn't provide? – dts Dec 16 '21 at 18:09
-
2I have no idea, as I haven't seen the code in question. The fact that you are using a high performance library does not necessarily mean that there's no room for improvement. – Andrey Semashev Dec 16 '21 at 19:46
-
1Also, I should note that the amount of speedup you can get depends on the hardware that will be running your software. For example, Zen 1 won't benefit much from AVX2 since its execution units are 128-bit internally, so 256-bit operations will have twice the latency compared to 128-bit ones. Other CPUs will benefit more. – Andrey Semashev Dec 16 '21 at 19:53
-
It can be assumed that the code is just looping through 2 100-element arrays of doubles and multipliying them element-wise. It can be assumed that the hardware is anything "most likely to succeed" (say, for definiteness, with AVX but not AVX2), since my question is generic on the topic of 32-bit compile versus 64-bit compile of the same code (MSVC /arch:AVX 64-bit speed gain from vectorization is on the order of 3x, but 32-bit speed gain is on the order of 2x, very nearly the same as the speed gain for the 32-bit when compiled with /arch:SSE2). – dts Dec 16 '21 at 20:19
-
1Looping through an array performing trivial operations on the data - that is the problem right there. You aren't making full use of the execution ports on the CPU, so you are memory bandwidth limited. Executing the code in 128, 256, or 512 bit chunks isn't going to make a difference to you in this case. Also be aware that if you aren't using the results anywhere, the compiler may have optimised out the entire loop.... – robthebloke Dec 16 '21 at 22:59
-
Thanks for the comment. Not sure exactly where you are coming from though ... everything is working great for /arch:SSE2, the code runs almost twice as fast as it does with no vectorization. But for the 32-bit version I don't get anything more with /arch:AVX ... is your comment addressing this? I should repeat, I am getting the expected AVX speed increase with the 64-bit compile ... – dts Dec 16 '21 at 23:14
-
1Please [edit] your question and add all relevant information you gave in the comments. Also, ideally provide something like a [mre] and describe how you benchmarked it (it does not have to be exactly the code of your library, but it should illustrate the problem) – chtz Dec 17 '21 at 00:08
-
@dts Did you actually inspect the generated assembler code? `/arch:AVX` does not mean that the compiler will generate 256-bit vector instructions. And as a general comment, don't expect much from compiler auto-vectorization. It is good as a free added bonus, but if you actually care about performance of a certain piece of code, you should manually vectorize it. – Andrey Semashev Dec 17 '21 at 15:35
-
I don't expect anything for auto-vectorization from the MS compiler ... when I've dumped the logs for what it does, it's pitiful. And the particular thing I'm vectorizing for this project (binomial trees for option pricing) I don't think any compiler will auto-handle in the usual form the code takes. – dts Dec 17 '21 at 22:58
1 Answers
I'm answering this question myself even though the question should arguably just be deleted ... maybe it will help someone, sometime.
By the time I got my simd code punched up (aligning the memory made a big difference) and fiddled around with MSVC compiler options, my 32-bit compile started acting exactly as expected when comparing no simd to SSE4.2, AVX and AVX512. Benchmarking the sample code below showed speed improvement ratios of 48%, 22% and 10% for SSE4.2, AVX, AVX512, respectively, for the 32-bit.
Oddly, the 64-bit compile runs much faster for no simd but slightly SLOWER than the 32-bit for all three simd options (good subject for a new question).
I compiled the code with no /Qpar switch and /Qvec-report:2 /Qpar-report:2 to verify to the extent possible that there was no auto-vectorization or auto-parallelization going on.
int Simd_debug(int idebug_branch, int iters, int asize)
{
int j, k, iret = -3;
double u, d;
double* TR = 0;
double* UP = 0;
double* DN = 0;
char* TR_unaligned = 0;
char* UP_unaligned = 0;
char* DN_unaligned = 0;
const int vectorsize = SIMD_SIZE_SPN; //8, 4, 2 = AVX512, AVX, SSE size
#if SIMD_SIZE_SPN == 8
Vec8d vec_up, vec_dn, vec_tree;
#elif SIMD_SIZE_SPN == 4
Vec4d vec_up, vec_dn, vec_tree;
#else
Vec2d vec_up, vec_dn, vec_tree;
#endif
const bool go_align_mem = true;
bool go_simd = (idebug_branch != 2);
bool go_intrinsic = (idebug_branch == 1);
int alignby = sizeof(double) * vectorsize;
int datasize = asize;
int arraysize = (datasize + vectorsize - 1) & (-vectorsize);
int regularpart = arraysize & (-vectorsize);
if (go_simd)
{
if (go_align_mem)
{
TR_unaligned = new char[arraysize * sizeof(double) + alignby];
char* TR_aligned = (char*)(((size_t)TR_unaligned + alignby - 1) & (-alignby));
TR = (double*)TR_aligned;
UP_unaligned = new char[arraysize * sizeof(double) + alignby];
char* UP_aligned = (char*)(((size_t)UP_unaligned + alignby - 1) & (-alignby));
UP = (double*)UP_aligned;
DN_unaligned = new char[arraysize * sizeof(double) + alignby];
char* DN_aligned = (char*)(((size_t)DN_unaligned + alignby - 1) & (-alignby));
DN = (double*)DN_aligned;
//debug check alignment
if ((((uintptr_t)TR & (alignby - 1)) != 0) || (((uintptr_t)UP & (alignby - 1)) != 0) || (((uintptr_t)DN & (alignby - 1)) != 0))
{
iret = -703;
goto bail_out;
}
}
else
{
TR = new double[arraysize];
UP = new double[arraysize];
DN = new double[arraysize];
}//if (go_align_mem)
}
else
{
TR = new double[arraysize];
UP = new double[arraysize];
DN = new double[arraysize];
}//if (go_simd)
u = 1.01;
d = 0.99;
UP[0] = u;
DN[0] = d;
for (k = 1; k < arraysize; k++)
{
UP[k] = u * UP[k - 1];
DN[k] = d * DN[k - 1];
}
for (j = 0; j < iters; j++)
{
if (go_simd)
{
for (k = 0; k < regularpart; k += vectorsize)
{
vec_up.load(UP + k);
vec_dn.load(DN + k);
vec_tree = vec_up * vec_dn;
vec_tree.store(TR + k);
}
}
else
{
#pragma loop(no_vector) //don't need this, according to /Qvec-report:2 ...
for (k = 0; k < arraysize; k++)
{
TR[k] = UP[k] * DN[k];
}
}//if (go_simd)
}
iret = 10000 * idebug_branch + arraysize;
bail_out:
if (go_simd && go_align_mem)
{
delete[] TR_unaligned;
delete[] UP_unaligned;
delete[] DN_unaligned;
}
else
{
delete[] TR;
delete[] UP;
delete[] DN;
}
return iret;
}

- 125
- 1
- 10
-
`int` instead of `size_t` loop counters could possibly be redoing sign-extension inside the loop. That might explain why 32-bit is faster; in that case `int` is already pointer-width. Or maybe some coincidence of code alignment, e.g. possibly the JCC erratum on a Skylake-family CPU. – Peter Cordes Dec 28 '21 at 22:24
-
Thanks, that's a big help. I'm testing on a Rocket Lake CPU not Skylake, but you nailed it with the size_t ... changing to size_t increased the 64-bit speed by about 14%, 10% and 27% for SSE4.2, AVX and AVX512, respectively, and made the 64-bit a chunk faster than the 32-bit (but not nearly as much faster as the non-simd difference). I'm responsible for my own code and in this case I was being lazy by just copying from Agner's doc since it's the first time I've used his VCL ... but why would Agner, the king of optimization, write all of his loops with ints (section 9.4 of his VCL Manual)? – dts Dec 29 '21 at 00:40
-
Ok, so probably just MSVC wasting extra instructions on sign-extension, not the JCC erratum workaround which doesn't apply to Rocket Lake. Good compilers (e.g. GCC and clang) can optimize loops with pointers and `int` indexes in many cases (including this one: https://godbolt.org/z/x1Ybb9Yxq), because signed-overflow is UB. http://blog.llvm.org/2011/05/what-every-c-programmer-should-know.html. MSVC sometimes doesn't, even with `-O2` optimization enabled. (GCC / clang aren't perfect either, but this case is quite simple for the inner-most loop that doesn't have an `if` inside it.) – Peter Cordes Dec 29 '21 at 03:59
-
Thanks for the additional insights, very helpful. I've been using /O2 and /Ot but not "whole program optimization". Planning on moving this project to Clang after the MSVC version is stable. – dts Dec 30 '21 at 12:21