Questions tagged [half-precision-float]

half-precision 16-bit floating point

Most uses of 16-bit floating point are the binary16 aka half-precision floating point format, but other formats with different choices of exponent vs. significand bits are possible.

(However, related formats like Posit that have similar uses but a different binary format are not covered by this tag)

The tag wiki has links to more info, and lists other tags. (This tag was temporarily a synonym of , but should stay separate because half-precision is less widely implemented than float / binary32 and double / binary64.)


16-bit floating point has less precision (mantissa aka significand bits) and less range (exponent bits) than the widely used 32-bit single-precision IEEE754 binary32 float or 64-bit binary64 double. But it takes less space, reducing memory bandwidth requirements, and on some GPUs has better throughput.

It's fairly widely supported on GPUs, but on x86 CPUs at least, support is limited to conversion to/from float. (And only on CPUs that support AVX and the F16C extension, e.g. Intel starting with IvyBridge.)

If any CPU SIMD extension supported math on half-precision directly, it would have twice the elements per SIMD vector and thus twice the throughput of float for vectorizable tasks. But such support is not widespread in 2020 if it exists at all.

70 questions
2
votes
1 answer

Mixed precision training report RET_CHECK failure, ShapeUtil::Equal(first_reduce->shape(), inst->shape())

New setup: 2x2080ti Nvidia driver: 430 Cuda 10.0 Cudnn 7.6 Tensorflow 1.13.1 Old setup: 2x1080ti Nvidia driver:410 Cuda 9.0 Tensorflow 1.10 I implemented a model for segmentation, it can be trained under FP32 or mixed precision (following…
Andcircle
  • 21
  • 3
2
votes
1 answer

__fp16 type undefined in GNU ARM C++

I'm trying to use type __fp16 (half precision float) in program compiled with GNU ARM C++ Compiler however whenever I try to declare variable of this type I get an error message __fp16 is not declared. I assume that it's caused by the fact that I…
Axel
  • 61
  • 1
  • 8
1
vote
0 answers

Using bfloat16 with C++23 on x86 CPUs using g++13

I'm trying to use bfloat16 as a format for an application for work on HPC-clusters. For this I've installed g++13 which supposedly supports the bfloat16 format but this hasn't been working consistently for me. On my local machine it works and…
Vistemboir
  • 11
  • 1
1
vote
1 answer

Can language model inference on a CPU, save memory by quantizing?

For example, according to https://cocktailpeanut.github.io/dalai/#/ the relevant figures for LLaMA-65B are: Full: The model takes up 432.64GB Quantized: 5.11GB * 8 = 40.88GB The full model won't fit in memory on even a high-end desktop…
rwallace
  • 31,405
  • 40
  • 123
  • 242
1
vote
1 answer

How to Enable Mixed precision training

i'm trying to train a deep learning model on vs code so i would like to use the GPU for that. I have cuda 11.6 , nvidia GeForce GTX 1650, TensorFlow-gpu==2.5.0 and pip version 21.2.3 for windows 10. The problem is whenever i run this part of code…
1
vote
1 answer

Using Half Precision Floating Point on x86 CPUs

I intend to use half-precision floating-point in my code but I am not able to figure out how to declare them. For Example, I want to do something like the following- fp16 a_fp16; bfloat a_bfloat; However, the compiler does not seem to know these…
Atharva Dubey
  • 832
  • 1
  • 8
  • 25
1
vote
1 answer

Reading a binary structure in Javascript

I have a table that I am trying to read in Javascript, with data that is large enough that I would like to have it in binary format to save space. Most of the table is either numbers or enums, but there is some data that is strings. I'm trying to…
PearsonArtPhoto
  • 38,970
  • 17
  • 111
  • 142
1
vote
1 answer

Why does converting from np.float16 to np.float32 modify the value?

When converting a number from half to single floating representation I see a change in the numeric value. Here I have 65500 stored as a half precision float, but upgrading to single precision changes the underlying value to 65504, which is many…
Mikhail
  • 7,749
  • 11
  • 62
  • 136
1
vote
0 answers

Is there an implementation of Keras Adam Optimizer to support Float16

I am currently working with deploying tiny-yolov3 on openvino toolkit and for that i need to convert my model to float16. But for that I need an optimizer that supports FP16. I tried modifying SGD to support fp16 but its accuracy is too low. So, I…
1
vote
0 answers

Falling to use TensorCore from Tensorflow Mixed Precision Tutorial

I have followed the Mixed precision tutorial from Tensorflow: https://www.tensorflow.org/guide/keras/mixed_precision but apparently I fail to use TensorCore. My setup: Windows 10 Nvidia driver: 441.87 python: 3.7 Cuda: 10.2 Tensorflow:…
1
vote
1 answer

How do I pass half to a vertex shader?

The D3D11 Input Element Description has a field that specifies the format. How can I pass halfs (e.g. DXGI_FORMAT_R16_FLOAT) to the Input assembler when we have float (i.e. 32 bit fp) only on the CPU side?
Raildex
  • 3,406
  • 1
  • 18
  • 42
1
vote
1 answer

Encoding Numbers into IEEE754 half precision

I have a quick question about a problem I'm trying to solve. For this problem, I have to convert (0.0A)16 into IEEE754 half precision floating point standard. I converted it to binary (0000.0000 1010), normalized it (1.010 * 2^5), encoded the…
asavyy
  • 31
  • 1
  • 9
1
vote
0 answers

Conversion Precision Error when converting IEE Half Precision Floating Point to Decimal

I have some precision error during the conversion from 16 bit half precision floating point format to decimal. It is able to accurately convert certain numbers while at the same time not accurate for others. The code was originally designed to be…
Kai
  • 31
  • 4
1
vote
1 answer

Is TensorRT "floating-point 16" precision mode non-deterministic on Jetson TX2?

I'm using TensorRT FP16 precision mode to optimize my deep learning model. And I use this optimised model on Jetson TX2. While testing the model, I have observed that TensorRT inference engine is not deterministic. In other words, my optimized model…
0
votes
3 answers

How to convert a float to a half type and the other way around in C

How can I convert a float (float32) to a half (float16) and the other way around in C while accounting for edge cases like NaN, Infinity etc. I don't need arithmetic because I just need the types in order to fulfill the requirement of supporting…
juffma
  • 39
  • 5