Questions tagged [ieee-754]

IEEE 754 is the most common & widely used floating-point standard, notably the single-precision binary32 aka float and double-precision binary64 aka double formats.

IEEE 754 is the Institute of Electrical and Electronics Engineers standard for floating-point computation, and is the most common & widely used implementation thereof.

Wikipedia on IEEE 754 (2008)
ieee.org documentation
https://en.wikipedia.org/wiki/Single-precision_floating-point_format aka binary32, usually called float or real4. Nice diagrams of the bit-pattern, and range over which it can represent every integer exactly, and so on.
https://en.wikipedia.org/wiki/Double-precision_floating-point_format usually called double or real8
Algorithm to convert an IEEE 754 double to a string? including the recent Ryū: fast float-to-string conversion

As well as formats, IEEE754 also defines the basic operations, + - * / and sqrt, as producing correctly-rounded results (error <= 0.5ulp). Other functions like pow and sin are not required to be as accurate; that's an implementation choice between precision and performance.

This is why many CPU instruction sets only include the basic operations (including sqrt).

1447 questions

votes

4 answers

In C, is specifying 2.0f the same as 2.000000f?

Are these lines the same? float a = 2.0f; and float a = 2.000000f;

c ieee-754

asked Jan 05 '13 at 03:03

Hải Phong

5,094
6
31
49

votes

3 answers

For a floating point value a: Does a*0.0 == 0.0 always evaluate true for finite values of a?

I was always assuming that the following test will always succeed for finite values (no INF, no NAN) of somefloat: assert(somefloat*0.0==0.0); In Multiply by 0 optimization it was stated that double a=0.0 and double a=-0.0 are not strictly speaking…

c++ c floating-point ieee-754

asked Dec 20 '12 at 12:31

Martin

4,738
4
28
57

votes

5 answers

Why float variable saves value by cutting digits after point in a weird way?

I have this simple code line: float val = 123456.123456; when i print this val or look in scope, it stores value 123456.13 Ok, it's fine, it can't store all those digits after point just in 4 bytes, but why does it make 13 after the point?…

c++ floating-point ieee-754 floating-point-precision digits

asked Feb 17 '12 at 13:11

Kosmo零

4,001
9
45
88

votes

1 answer

Why does the IEEE 754 standard use a 127 bias?

When working with the excess representation of integers, I use a bias of 2n-1. However, the IEEE 754 standard instead uses 2n-1 - 1. The only benefit that I can think of is a bigger positive range. Are there any other reasons as to why that decision…

floating-point ieee-754

asked Jan 18 '12 at 12:07

james_dean

1,477
6
26
37

votes

4 answers

what languages expose IEEE 754 traps to the developer?

I'd like to play with those traps for educational purpose. A common problem with the default behavior in numerical calculus is that we "miss" the Nan (or +-inf) that appeared in a wrong operation. Default behavior is propagation through the…

floating-point ieee-754 floating-point-exceptions

asked Mar 30 '09 at 21:19

nraynaud

4,924
7
39
54

votes

1 answer

Is there any definition how floating-point values evaluated at compile-time are rounded?

Is there any definition how floating-point values evaluated at compile-time are rounded in C or C++ ? F.e. when I have double d = 1.0 / 3.0; ? I.e. what kind of rounding is done at compile-time. And is there a definition of what's the…

c++ c floating-point ieee-754

asked Nov 07 '21 at 17:45

Bonita Montero

2,817
9
22

votes

2 answers

How many different sums can we get from very few floats?

Someone just asked why sum(myfloats) differed from sum(reversed(myfloats)). Quickly got duped to Is floating point math broken? and deleted. But it made me curious: How many different sums can we get from very few floats, just by summing them in…

math floating-point ieee-754 floating-accuracy

asked Sep 15 '21 at 12:17

no comment

6,381
4
12
30

votes

1 answer

Why is 5726718050568503296 truncated in JS

As per the standard ES implements numbers as IEEE754 doubles. And per https://www.binaryconvert.com/result_double.html?decimal=053055050054055049056048053048053054056053048051050057054 and other programming languages…

javascript floating-point ieee-754

asked Apr 22 '21 at 23:44

zerkms

249,484
69
436
539

votes

2 answers

Standard for the sine of very large numbers

I am writing an (almost) IEEE 854 compliant floating point implementation in TeX (which only has support for 32-bit integers). This standard only specifies the result of +, -, *, /, comparison, remainder, and sqrt: for those operations, the result…

floating-point decimal ieee-754

asked Jul 12 '11 at 14:02

Bruno Le Floch

votes

2 answers

Why does isnan(x) exist if x != x gives the same result?

It is well known that for any variable of floating-point type x != x iff (if and only if) x is NaN (not-a-number). Or inverse version: x == x iff x is not NaN. Then why did WG14 decide to define isnan(x) (math.h) if the same result can be obtained…

c nan ieee-754

asked Jan 29 '21 at 21:17

pmor

5,392
4
17
36

votes

1 answer

Understanding compilation result for std::isnan

I always assumed, that there is practically no difference between testing for NAN via x!=x or std::isnan(x) However, gcc provides different assemblers for both versions (live on godbolt.org): ;x!=x: ucomisd %xmm0, %xmm0 movl $1, %edx …

c++ gcc optimization x86 ieee-754

asked Jul 11 '18 at 21:07

ead

32,758
6
90
153

votes

1 answer

Does a floating-point reciprocal always round-trip?

For IEEE-754 arithmetic, is there a guarantee of 0 or 1 units in the last place accuracy for reciprocals? From that, is there a guaranteed error-bound on the reciprocal of a reciprocal?

floating-point precision floating-accuracy ieee-754

asked Jun 19 '17 at 06:11

Raymond Hettinger

216,523
63
388
485

votes

5 answers

How to alter double by its smallest increment

Is something broken or I fail to understand what is happening? static String getRealBinary(double val) { long tmp = Double.doubleToLongBits(val); StringBuilder sb = new StringBuilder(); for (long n = 64; --n > 0; tmp >>= 1) if…

java double ieee-754

asked Oct 10 '10 at 13:16

Margus

19,694
14
55
103

votes

1 answer

Questions regarding operations on NaN

My SSE-FPU generates the following NaNs: When I do a any basic dual operation like ADDSD, SUBSD, MULSD or DIVSD and one of both operands is a NaN, the result has the sign of the NaN-operand and the lower 51 bits of the mantissa of the result is…

floating-point x86 sse nan ieee-754

asked Jun 18 '16 at 10:33

Bonita Montero

2,817
9
22

votes

3 answers

Encoding and decoding IEEE 754 floats in JavaScript

I need to encode and decode IEEE 754 floats and doubles from binary in node.js to parse a network protocol. Are there any existing libraries that do this, or do I have to read the spec and implement it myself? Or should I write a C module to do it?

javascript floating-point node.js ieee-754

asked Sep 18 '10 at 08:10

nornagon

15,393
18
71
85

Prev 1 2 3

…

96 97 Next