Quadruple-precision floating-point format

In computing, quadruple precision (or quad precision) is a binary floating-point–based computer number format that occupies 16 bytes (128 bits) with precision at least twice the 53-bit double precision.

This 128-bit quadruple precision is designed not only for applications requiring results in higher than double precision, but also, as a primary function, to allow the computation of double precision results more reliably and accurately by minimising overflow and round-off errors in intermediate calculations and scratch variables. William Kahan, primary architect of the original IEEE 754 floating-point standard noted, "For now the 10-byte Extended format is a tolerable compromise between the value of extra-precise arithmetic and the price of implementing it to run fast; very soon two more bytes of precision will become tolerable, and ultimately a 16-byte format ... That kind of gradual evolution towards wider precision was already in view when IEEE Standard 754 for Floating-Point Arithmetic was framed."

In IEEE 754-2008 the 128-bit base-2 format is officially referred to as binary128.

This article is issued from Wikipedia. The text is licensed under Creative Commons - Attribution - Sharealike. Additional terms may apply for the media files.

Floating-point formats
IEEE 754
16-bit: Half (binary16) 32-bit: Single (binary32), decimal32 64-bit: Double (binary64), decimal64 128-bit: Quadruple (binary128), decimal128 256-bit: Octuple (binary256) Extended precision
Other
Minifloat bfloat16 TensorFloat-32 Microsoft Binary Format IBM floating-point architecture PMBus Linear-11 G.711 8-bit floats
Alternatives
Arbitrary precision

Computer architecture bit widths
Bit
1 4 8 12 16 18 24 26 28 30 31 32 36 45 48 60 64 128 256 512 bit slicing
Application
8 16 32 64
Binary floating-point precision
16 (×½) 24 32 (×1) 40 64 (×2) 80 128 (×4) 256 (×8)
Decimal floating-point precision
32 64 128