List of ARM instructions implementing half-precision floating-point arithmetic

Question

Arm Architecture Reference Manual for A-profile architecture (emphasis added):

FPHP, bits [27:24]

0b0011 As for 0b0010, and adds support for half-precision floating-point arithmetic.

A simple question: where is to find a list of ARM instructions implementing half-precision floating-point arithmetic?

UPD. Per Clang for Arm (armclang) documentation:

The __fp16 data type is not an arithmetic data type. The __fp16 data type is for storage and conversion only.
The _Float16 data type is an arithmetic data type. Operations on _Float16 values use half-precision arithmetic.

Hence, when using Clang for Arm I need to use _Float16 (not __fp16).

Per GCC for Arm documentation:

The __fp16 type may only be used as an argument to intrinsics defined in <arm_fp16.h>, or as a storage format. For purposes of arithmetic and other operations, __fp16 values in C or C++ expressions are automatically promoted to float. It is recommended that portable code use the _Float16 type defined by ISO/IEC TS 18661-3:2015.

Hence, when using GCC for Arm I need to use _Float16 (not __fp16).

However, then why in this example from Nate Eldredge GCC for Arm generates vmul.f16 instead of half<->float conversions followed by vmul.f32? Per quote above __fp16 values in C or C++ expressions are automatically promoted to float. Why they are not promoted to float in this case?

@artlessnoise Arm Architecture Reference Manual for A-profile architecture says "This document defines the Armv8-A and Armv9-A architecture profiles". Hence, I think that it is ARMv8. — pmor, May 15 '23 at 16:10
@artlessnoise I've already [tried](https://godbolt.org/z/xzKvGrqfx) to make GCC to geneate this instrutions. There instead of half<->float conversions followed by `vmul.f32` I expect to see a `vmul.f16` (or how it is called?) w/o any conversions. Any ideas? — pmor, May 15 '23 at 16:13
@artlessnoise [Here](https://github.com/microsoft/llvm/blob/master/test/MC/ARM/fullfp16-neon.s) I see the `.f16` suffixes. Hence, it seems that instructions implementing half-precision floating-point arithmetic have `.f16` suffixes. — pmor, May 15 '23 at 16:22
@pmor passing `-mfpu=neon-fp16` seems to get it further but not much — Sam Mason, May 15 '23 at 18:52
[This example](https://godbolt.org/z/6MdGxj5Mf) shows the tooling can understand `vmul.f16`. It is just the compiler sees better instructions and decides not to use them. — artless noise, May 15 '23 at 19:25
@artlessnoise: It's not clear to me why three extra conversion instructions would be "better". In ARM64 [gcc seems quite happy to use fmul directly](https://godbolt.org/z/oMGK1ehd7) so I wonder if there is something else going on. — Nate Eldredge, May 15 '23 at 19:42
@artlessnoise: Oh, I see - we still didn't have the right compiler switches to say that FP16 is supported. The conversion instructions to and from f16 are part of the base ARMv8 floating-point support, but the actual arithmetic is a separate feature. Using `-O3 -mfpu=fp-armv8 -march=armv8.2-a+fp16` we get `vmul.f16` as desired. https://godbolt.org/z/c44q54e9v — Nate Eldredge, May 15 '23 at 19:57
In the example code, conceptually we are taking two `__fp16` arguments, promoting to `float`, multiplying, and converting the result back to `__fp16`. I think it's a valid optimization to do a half-precision multiply and save all the extra conversions. You will see a difference if you change the return type to `float`: https://godbolt.org/z/Yzq58da4Y. With `__fp16` arguments we must widen to 32 bits and do a 32-bit multiply. With `_Float16` we do a 16-bit multiply and then widen the result. This all seems consistent with the documentation you quoted. — Nate Eldredge, Jun 09 '23 at 16:36
@NateEldredge Re: "valid optimization to do a half-precision multiply and save all the extra conversions": indeed, thanks! Due to "one year passed since my last experience with FP" I've confused the things. — pmor, Jun 15 '23 at 08:44

Nate Eldredge · Accepted Answer · 2023-05-15T20:03:47.383

1

It's not really a separate list. When this feature is present, basically all the floating-point instructions that already exist gain support for half-precision.

In AArch64 state, you use the same floating-point instruction mnemonics, using h registers or vector element sizes to specify a half-precision operation. For example, fadd h0, h1, h2 does a half-precision floating-point add (scalar), and fadd v0.8h, v1.8h, v2.8h does eight such adds in parallel (vector).

In AArch32 state, you use a .f16 suffix on the mnemonic. So vadd.f16 s0, s1, s2 (in 32-bit state the h register names are not used, and the result is zero-extended into the 32-bit s register). Or (untested) vadd.f16 d0, d1, d2 for a four-element vector add, or vadd.f16 q0, q2, q4 for eight elements.

If you really want a list of all the instruction forms added by the FP16 feature, you can skim the tables in the Instruction Set Encoding chapters of the Architecture Reference Manual and look for FP16 in the Feature column. Or search for (FEAT_FP16) in the instruction descriptions chapter.

edited May 15 '23 at 20:03

answered May 15 '23 at 19:34

Nate Eldredge

48,811
6
54
82

Consider [this](https://godbolt.org/z/jvP4vzvfs). A simple question: why doesn't GCC generate `vfma.f16`? Because there is no `vfma.f16` or because GCC not yet has logic to generate `vfma.f16`? – pmor May 19 '23 at 15:18
@pmor: Good question, I don't know. The docs seem pretty clear that it does exist, and is part of FP16, but I can't get either gcc or clang to generate it for either AArch32 or AArch64. – Nate Eldredge May 19 '23 at 15:41
Reported: [1](https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110105), [2](https://github.com/llvm/llvm-project/issues/63090). – pmor Jun 03 '23 at 16:55
Observation: using ARM64 GCC 13.1.0 with `-mfpu=fp-armv8` leads to `error: unrecognized command-line option '-mfpu=fp-armv8'` ([demo](https://godbolt.org/z/oarh7xnc8)). – pmor Jul 04 '23 at 09:18
1

@pmor: Indeed, `-mfpu` is specific to ARM32, which the gcc manual makes clear. It's not needed for the ARM64 compiler because the ARM64 architecture only has one type of FPU. – Nate Eldredge Jul 04 '23 at 15:48

List of ARM instructions implementing half-precision floating-point arithmetic

1 Answers1