17

The __fp16 floating point data-type is a well known extension to the C standard used notably on ARM processors. I would like to run the IEEE version of them on my x86_64 processor. While I know they typically do not have that, I would be fine with emulating them with "unsigned short" storage (they have the same alignment requirement and storage space), and (hardware) float arithmetic.

Is there a way to request that in gcc?

I assume the rounding might be slightly "incorrect", but that is ok to me.

If this were to work in C++ too that would be ideal.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Nonyme
  • 1,220
  • 1
  • 11
  • 22
  • 1
    I don't think it has this for x86 targets. If it did, it would be *very* slow, because it would all have to be run in software emulation, rather than using FP hardware. Why would you want to do this? – Cody Gray - on strike Jul 14 '17 at 18:39
  • 8
    @CodyGray: half-precision floats are natively supported by reasonably recent (Intel since Ivy Bridge, AMD since Piledriver) x86 CPUs (as a storage format only, conversion to single precision is required to do actual computation). –  Jul 14 '17 at 18:41
  • 3
    Ah yes, so they are, @Fanael. Thanks for pointing that out. I had missed their introduction. So what you would use would be `_mm256_cvtph_ps` as the "load" (convert half-float to float), and `_mm256_cvtps_ph` as the "store" (convert float to half-float). It turns out this is reasonably fast, and is actually useful in situations where you're memory-constrained. Would it be acceptable, Nonyme, to implement this using intrinsics in something like a platform abstraction library? Or are you dead-set on having the compiler generate this code implicitly? – Cody Gray - on strike Jul 14 '17 at 18:50
  • Gcc doesn't properly model a 16-bit float type on x86. An instruction like vcvtph2ps is currently represented as a black box that takes a vector of short and returns a vector of float. So hand-writing the code with intrinsics is your best option. – Marc Glisse Jul 15 '17 at 08:35
  • 5
    The goal is to run a huge code-base designed for ARM, on an x86_64 server farm. If the "platform abstraction library" do not need any modification of the code, then that is ok. But I doubt that is doable. Note: I managed to trick Clang in doing just that by tricking the semantic parser to define __fp16 and accept it as function argument/return values on x86_64. It then managed to use the aforementioned intrinsic to do the conversions and compute using floats instead. – Nonyme Jul 16 '17 at 18:53
  • Can you explain how you "tricked" clang into doing this? I'm having the exact same problem with gcc. – underpickled Jul 20 '18 at 17:34
  • 2
    I edited clang source code to add the __fp16 built-in type on X86 targets (by default it is only enabled on ARM). Then the rest of the compiler dealt with it by itself. – Nonyme Jul 20 '18 at 18:49
  • You can try `clang -cc1 -fnative-half-type -fallow-half-arguments-and-returns` – Nonyme Jul 20 '18 at 19:07
  • 1
    @CodyGray, there are more reasons to do thing besides performance. I'm currently dealing with upwards to 8 gigs of floating point data in memory, 16bits is more than enough precision, and the cost of converting to/from _Float16 would not be a performance bottleneck. – Zendel Apr 05 '19 at 12:49
  • @Zendel But...once you're doing software emulation, you can build an even better custom (domain-specific) solution that will likely have better precision *and* be faster. The edge-cases of IEEE floating point arithmetic are ugly, and often not needed. I'm also pretty skeptical of the claim that 16 bits is "more than enough precision". It's certainly not if you're doing arithmetic on those values, thanks to issues like catastrophic cancellation. So, to me, 16-bit FP is really only useful as a storage/representational format. Why do all those values need to be in memory at once? – Cody Gray - on strike Apr 06 '19 at 06:32
  • 1
    @CodyGray, how can you seriously know if 16 bits is enough precision for other people! 16 bit floating point is used extensively in machine learning algorithms on video cards. And yes, they do do arithmetic. Also, consider LAB format images. You convert an RGB colorspace (8 bits per channel) to an LAB color space (32 bits per channel). Clearly 16 bits of floating point precision is going to be more than enough. In OpenCV, they went so far as to introduce their own custom packing for 16 bit floating point values for a *single* algorithm: stereobm. Maybe it'd be nice to avoid such hacks. – Zendel Apr 07 '19 at 15:48

3 Answers3

6

I did not find a way to do so in gcc (as of gcc 8.2.0).

As for clang, in 6.0.0 the following options showed some success:

clang -cc1 -fnative-half-type -fallow-half-arguments-and-returns

The option -fnative-half-type enable the use of __fp16 type (instead of promoting them to float). While the option -fallow-half-arguments-and-returns allows to pass __fp16 by value, the API being non-standard be careful not to mix different compilers.

That being said, it does not provide math functions using __fp16 types (it will promote them to/from float or double).

It was sufficient for my use case.

Nonyme
  • 1,220
  • 1
  • 11
  • 22
  • 2
    There's good reason for the lack of `__fp16` math functions: x86 support for half-precision is limited to conversion to `float` ([`vcvtph2ps`](http://felixcloutier.com/x86/VCVTPH2PS.html) and the reverse, and only for SIMD vectors, not scalar). So it's useful only for reducing the cache footprint of an array at the cost of an ALU conversion when loading and storing. Even conversion to `double` takes 2 steps. You definitely don't want to be passing around `__fp16` data in registers on x86 because every computation would have to convert to float and back. – Peter Cordes Oct 31 '18 at 22:34
  • 1
    (Update: Sapphire Rapids has full scalar and SIMD support for [AVX-512 FP16](https://en.wikipedia.org/wiki/AVX-512#FP16) math instructions, as well as BF16 which appeared in some earlier CPUs too. [Half-precision floating-point arithmetic on Intel chips](https://stackoverflow.com/q/49995594)) – Peter Cordes Feb 07 '23 at 02:48
3

C++23 introduces std::float16_t

#include <stdfloat> // C++23
 
int main()
{
    std::float16_t f = 0.1F16;
}
jpr42
  • 718
  • 3
  • 14
0

_Float16 is the type you should be looking for now in recent versions of clang and gcc.

At least in the compilers I've worked with __fp16 was a limited type that you could only convert to/from binary32 (using hardware where supported) while _Float16 is more like a "real" arithmetic type, not that you should attempt too much in such limited precision.

MDH
  • 21
  • 2