2

I'm learning openMP and with my limited knowledge, have parallised my code. I'm trying to improve this code using openMP vectorisation techniques. But while going through relevant reading materials(link), I found it is not possible to do vectorisation operations on long double datatype. Can someone provide information on why it is so and suggest a solution other than to reduce the precision?

The content in the link is as follows: "Avoid operations not supported in SIMD hardware. Arithmetic with (80 bit) long doubles on Linux, and the remainder operator “%” are examples of operations not supported in SIMD hardware. "

P.S. I'm using INTEL C++ compiler 16.0.2, INTEL XEON processsor with 128bit long vector register and Linux. My datatypes are mostly long double.

Z boson
  • 32,619
  • 11
  • 123
  • 226
prasanna
  • 61
  • 4
  • I think you are confusing parallelism with vector intrinsic. I suspect it is possible to parallelize these operations, even if they don't map to Intel's ISA. Anyways, you need to post a complete, run-able example. – Mikhail May 09 '16 at 07:40
  • Because the x86 SIMD hardware (SSE through AVX512) only supports 32-bit and 64-bit float operations and does not have integer divide instructions. – Z boson May 09 '16 at 09:26
  • Why are you using long double? – Z boson May 09 '16 at 09:29
  • @Zboson I'm developing a CFD solver. To ensure numerical stability of the scheme we need good precision. Thats why. – prasanna May 09 '16 at 10:06
  • @prasanna: There are very few algorithms that become stable only with 54 bits of precision, and those few are trivially converted to equivalent algorithms that are stable with less bits. You might find that iterative solutions converge in less steps if you have higher precision, that's the real benefit, but you lose that again when each higher precision step is slower. So 64 bit double is likely fastest. – MSalters May 10 '16 at 07:25

1 Answers1

3

The SIMD instructions of the x86 instruction set only support 32-bit and 64-bit floating point operations (with some limited support for 16-bit floats). Additionally, even though there are 64-bit times 64-bit to 128-bit scalar integer instructions (e.g. mulx) there are no corresponding SIMD instructions. Many people have tried and failed to implement efficient 128-bit integer x86 SIMD arithmetic (there are some exceptions for multiplication and maybe addition). There are no general x86 SIMD integer division instructions.

However, for floating point people have had more success with higher precision floating point SIMD operations using double-double. Double-double has 106-bits of precision compared with 64-bits of precision with 80-bit long double. But not every C++ compiler uses 80-bit long double. Some just use double (e.g. MSVC) which only has 54-bits of precision and some use 128-bit quad precision which has 113 bits of precision and Wikipedia even claims that with some compilers long double is implemented as double-double.

I described some details of double-double here. Note that double-double is not a IEEE floating point type and it has some unusual properties. Also, the range of double-double is the same as double so it only improves the precision.

How fast is double-double compared to long double? I have never tested this. But I found double-double to be about 10 times slower than double operations when doing a somewhat balanced mix of multiplication and addition operations. And long double is certainly slower than double (except when it's implemented as a double). But since you can use SIMD with double-double, but not with the bulit-in long double, then the speed improves proportional to the SIMD width. So 2 double-double operations with SSE2, 4 with AVX, and 8 with AVX512.

Don't expect OpenMP's simd construction to implement double-double though. You will need to implement this yourself or find a library.

Community
  • 1
  • 1
Z boson
  • 32,619
  • 11
  • 123
  • 226