Questions tagged [ieee-754]

IEEE 754 is the most common & widely used floating-point standard, notably the single-precision binary32 aka float and double-precision binary64 aka double formats.

IEEE 754 is the Institute of Electrical and Electronics Engineers standard for floating-point computation, and is the most common & widely used implementation thereof.

As well as formats, IEEE754 also defines the basic operations, + - * / and sqrt, as producing correctly-rounded results (error <= 0.5ulp). Other functions like pow and sin are not required to be as accurate; that's an implementation choice between precision and performance.

This is why many CPU instruction sets only include the basic operations (including sqrt).

1447 questions
10
votes
4 answers

In C, is specifying 2.0f the same as 2.000000f?

Are these lines the same? float a = 2.0f; and float a = 2.000000f;
Hải Phong
  • 5,094
  • 6
  • 31
  • 49
10
votes
3 answers

For a floating point value a: Does a*0.0 == 0.0 always evaluate true for finite values of a?

I was always assuming that the following test will always succeed for finite values (no INF, no NAN) of somefloat: assert(somefloat*0.0==0.0); In Multiply by 0 optimization it was stated that double a=0.0 and double a=-0.0 are not strictly speaking…
Martin
  • 4,738
  • 4
  • 28
  • 57
9
votes
5 answers

Why float variable saves value by cutting digits after point in a weird way?

I have this simple code line: float val = 123456.123456; when i print this val or look in scope, it stores value 123456.13 Ok, it's fine, it can't store all those digits after point just in 4 bytes, but why does it make 13 after the point?…
Kosmo零
  • 4,001
  • 9
  • 45
  • 88
9
votes
1 answer

Why does the IEEE 754 standard use a 127 bias?

When working with the excess representation of integers, I use a bias of 2n-1. However, the IEEE 754 standard instead uses 2n-1 - 1. The only benefit that I can think of is a bigger positive range. Are there any other reasons as to why that decision…
james_dean
  • 1,477
  • 6
  • 26
  • 37
9
votes
4 answers

what languages expose IEEE 754 traps to the developer?

I'd like to play with those traps for educational purpose. A common problem with the default behavior in numerical calculus is that we "miss" the Nan (or +-inf) that appeared in a wrong operation. Default behavior is propagation through the…
nraynaud
  • 4,924
  • 7
  • 39
  • 54
9
votes
1 answer

Is there any definition how floating-point values evaluated at compile-time are rounded?

Is there any definition how floating-point values evaluated at compile-time are rounded in C or C++ ? F.e. when I have double d = 1.0 / 3.0; ? I.e. what kind of rounding is done at compile-time. And is there a definition of what's the…
Bonita Montero
  • 2,817
  • 9
  • 22
9
votes
2 answers

How many different sums can we get from very few floats?

Someone just asked why sum(myfloats) differed from sum(reversed(myfloats)). Quickly got duped to Is floating point math broken? and deleted. But it made me curious: How many different sums can we get from very few floats, just by summing them in…
no comment
  • 6,381
  • 4
  • 12
  • 30
9
votes
1 answer

Why is 5726718050568503296 truncated in JS

As per the standard ES implements numbers as IEEE754 doubles. And per https://www.binaryconvert.com/result_double.html?decimal=053055050054055049056048053048053054056053048051050057054 and other programming languages…
zerkms
  • 249,484
  • 69
  • 436
  • 539
9
votes
2 answers

Standard for the sine of very large numbers

I am writing an (almost) IEEE 854 compliant floating point implementation in TeX (which only has support for 32-bit integers). This standard only specifies the result of +, -, *, /, comparison, remainder, and sqrt: for those operations, the result…
Bruno Le Floch
  • 244
  • 4
  • 12
9
votes
2 answers

Why does isnan(x) exist if x != x gives the same result?

It is well known that for any variable of floating-point type x != x iff (if and only if) x is NaN (not-a-number). Or inverse version: x == x iff x is not NaN. Then why did WG14 decide to define isnan(x) (math.h) if the same result can be obtained…
pmor
  • 5,392
  • 4
  • 17
  • 36
9
votes
1 answer

Understanding compilation result for std::isnan

I always assumed, that there is practically no difference between testing for NAN via x!=x or std::isnan(x) However, gcc provides different assemblers for both versions (live on godbolt.org): ;x!=x: ucomisd %xmm0, %xmm0 movl $1, %edx …
ead
  • 32,758
  • 6
  • 90
  • 153
9
votes
1 answer

Does a floating-point reciprocal always round-trip?

For IEEE-754 arithmetic, is there a guarantee of 0 or 1 units in the last place accuracy for reciprocals? From that, is there a guaranteed error-bound on the reciprocal of a reciprocal?
Raymond Hettinger
  • 216,523
  • 63
  • 388
  • 485
9
votes
5 answers

How to alter double by its smallest increment

Is something broken or I fail to understand what is happening? static String getRealBinary(double val) { long tmp = Double.doubleToLongBits(val); StringBuilder sb = new StringBuilder(); for (long n = 64; --n > 0; tmp >>= 1) if…
Margus
  • 19,694
  • 14
  • 55
  • 103
9
votes
1 answer

Questions regarding operations on NaN

My SSE-FPU generates the following NaNs: When I do a any basic dual operation like ADDSD, SUBSD, MULSD or DIVSD and one of both operands is a NaN, the result has the sign of the NaN-operand and the lower 51 bits of the mantissa of the result is…
Bonita Montero
  • 2,817
  • 9
  • 22
9
votes
3 answers

Encoding and decoding IEEE 754 floats in JavaScript

I need to encode and decode IEEE 754 floats and doubles from binary in node.js to parse a network protocol. Are there any existing libraries that do this, or do I have to read the spec and implement it myself? Or should I write a C module to do it?
nornagon
  • 15,393
  • 18
  • 71
  • 85