Questions tagged [ieee-754]

IEEE 754 is the most common & widely used floating-point standard, notably the single-precision binary32 aka float and double-precision binary64 aka double formats.

IEEE 754 is the Institute of Electrical and Electronics Engineers standard for floating-point computation, and is the most common & widely used implementation thereof.

Wikipedia on IEEE 754 (2008)
ieee.org documentation
https://en.wikipedia.org/wiki/Single-precision_floating-point_format aka binary32, usually called float or real4. Nice diagrams of the bit-pattern, and range over which it can represent every integer exactly, and so on.
https://en.wikipedia.org/wiki/Double-precision_floating-point_format usually called double or real8
Algorithm to convert an IEEE 754 double to a string? including the recent Ryū: fast float-to-string conversion

As well as formats, IEEE754 also defines the basic operations, + - * / and sqrt, as producing correctly-rounded results (error <= 0.5ulp). Other functions like pow and sin are not required to be as accurate; that's an implementation choice between precision and performance.

This is why many CPU instruction sets only include the basic operations (including sqrt).

1447 questions

votes

1 answer

Interpreting a 32bit unsigned long as Single Precision IEEE-754 Float in C

I am using the XC32 compiler from Microchip, which is based on the standard C compiler. I am reading a 32bit value from a device on a RS485 network and storing this in a unsigned long that I have typedef'ed as DWORD. i.e. typedef DWORD unsigned…

c precision point ieee-754

asked Nov 10 '15 at 15:09

Dino Alves

votes

2 answers

Does any floating point-intensive code produce bit-exact results in any x86-based architecture?

I would like to know if any code in C or C++ using floating point arithmetic would produce bit exact results in any x86 based architecture, regardless of the complexity of the code. To my knowledge, any x86 architecture since the Intel 8087 uses a…

c x86 ieee-754 fast-math

asked Nov 26 '14 at 12:59

Samuel Navarro Lou

1,168
6
17

votes

1 answer

How are IEEE-754 single and double precision formats determined?

I'm interested in how these are determined: Single precision has: 8 bits for e and rest (23 bits) are mantissa Double precision: 11 bits for e and rest (52 bits) are mantissa ofc there is 1 bit for sign. So how is it determined what number of bits…

precision ieee-754

asked Apr 14 '14 at 16:06

guber90

votes

3 answers

Why is Number.MAX_VALUE 1.7976931348623157e+308 instead of 9007199254740992e+1024?

JavaScript uses IEEE 754 for storing numbers, for both integer and floating point values, where 53 bits are used for representing the mantissa and 11 bits are used for representing the exponent. The maximum value representable with a signed 53 bit…

javascript ieee-754

asked Mar 14 '14 at 10:48

Alex Mathew

3,925
5
21
25

votes

3 answers

Convert a string with a hex representation of an IEEE-754 double into JavaScript numeric variable

Suppose I have a hex number "4072508200000000" and I want the floating point number that it represents (293.03173828125000) in IEEE-754 double format to be put into a JavaScript variable. I can think of a way that uses some masking and a call to…

javascript double hex ieee-754

asked Oct 20 '09 at 22:37

Nosredna

83,000
15
95
122

votes

4 answers

VC++ optimisations break comparisons with NaN?

IEEE754 requires NaNs to be unordered; less than, greater than, equal etc. should all return false when one or both operands are NaN. The sample below yields the correct F F F F F T as expected when compiled using g++ at all optimisation levels, and…

c++ visual-studio-2008 visual-c++ ieee-754

asked Apr 03 '13 at 11:52

moonshadow

86,889
7
82
122

votes

3 answers

C++ Portable Floating-Point Bit Representation?

Is there a C++ Standards compliant way to determining the structure of a 'float', 'double', and 'long double' at compile-time ( or run-time, as an alternative )? If I assume std::numeric_limits< T >::is_iec559 == true and std::numeric_limits< T…

c++ floating-point portability ieee-754 bit-representation

asked Mar 08 '13 at 18:44

Charles L Wilcox

1,126
8
18

votes

3 answers

Floating point addition: loss-of-precision issues

In short: how can I execute a+b such that any loss-of-precision due to truncation is away from zero rather than toward zero? The Long Story I'm computing the sum of a long series of floating point values for the purpose of computing the sample mean…

c# c++ floating-point ieee-754

asked Aug 10 '09 at 07:58

Eamon Nerbonne

47,023
20
101
166

votes

6 answers

How do I save a floating-point number in 2 bytes?

Yes I'm aware of the IEEE-754 half-precision standard, and yes I'm aware of the work done in the field. Put very simply, I'm trying to save a simple floating point number (like 52.1, or 1.25) in just 2 bytes. I've tried some implementations in Java…

c# binary floating-point ieee-754 numerical

asked May 02 '12 at 13:35

Robin Rodricks

110,798
141
398
607

votes

2 answers

addition instead of subtraction in Kahan algorithm

This is the Kahan summation algorithm from Wikipedia: function KahanSum(input) var sum = 0.0 var c = 0.0 for i = 1 to input.length do y = input[i] - c // why subtraction? t = sum + y c = (t - sum) - y …

c++ floating-point sum ieee-754 rounding-error

asked Dec 09 '11 at 14:22

fredoverflow

256,549
94
388
662

votes

2 answers

Are all single-precision numbers representable in the double-precision format?

Given an arbitrary number represented in the IEEE-754 single-precision format (commonly known as float in some languages/platforms) can I be certain that number can be represented exactly in the double-precision format as well? If so, is that…

floating-point ieee-754

asked Oct 05 '11 at 10:59

R. Martinho Fernandes

228,013
71
433
510

votes

1 answer

Should two programs compiled with -O0 and -O2 each produce identical floating point results?

Short example: #include #include #include #define PRINTVAR(x) printVar(#x, (x) ) void printVar( const std::string_view name, const float value ) { std::cout << std::setw( 16 ) << name …

c++ gcc floating-point precision ieee-754

asked Oct 21 '22 at 08:19

gbg

votes

2 answers

Is 128-bit "long-float" useful?

I realized the other day that most common lisp had 128-bit "long-floats". As a result, the most positive long float is: 8.8080652584198167656 * 10^646456992 while the most positive double float is 1.7976931348623157 * 10^308, which is pretty big…

floating-point precision ieee-754 128-bit

asked Jul 25 '11 at 09:13

Thaddee Tyl

1,126
1
12
17

votes

2 answers

Some C Floating Point Constants Don't Make Sense

The constants in for Apple clang version 12.0.0 (clang-1200.0.32.2) don't seem to make sense. DBL_MIN_EXP is -1021 and DBL_MAX_EXP is 1024. However, that doesn't match what wikipedia says, "exponents range from −1022 to +1023, ..." Also…

c floating-point constants ieee-754

asked Oct 16 '20 at 20:23

Raymond Hettinger

216,523
63
388
485

votes

1 answer

Why can a naive abs implementation not be optimized well in C++?

I was looking at how a naive implementation of abs(float) would compile and was quite surprised by the result: float abs(float x) { return x < 0 ? -x : x; } With clang 10.1 at -O3, this results in: .LCPI0_0: .long 2147483648 …

c++ optimization floating-point compiler-optimization ieee-754

asked Aug 10 '20 at 20:08

Jan Schultke

17,446
6
47
96

Prev 1 2 3

…

96 97 Next