Questions tagged [ieee-754]

IEEE 754 is the most common & widely used floating-point standard, notably the single-precision binary32 aka float and double-precision binary64 aka double formats.

IEEE 754 is the Institute of Electrical and Electronics Engineers standard for floating-point computation, and is the most common & widely used implementation thereof.

As well as formats, IEEE754 also defines the basic operations, + - * / and sqrt, as producing correctly-rounded results (error <= 0.5ulp). Other functions like pow and sin are not required to be as accurate; that's an implementation choice between precision and performance.

This is why many CPU instruction sets only include the basic operations (including sqrt).

1447 questions
11
votes
3 answers

Negative zero literal in golang

IEEE754 supports the negative zero. But this code a := -0.0 fmt.Println(a, 1/a) outputs 0 +Inf where I would have expected -0 -Inf Other languages whose float format is based on IEEE754 let you create negative zero literals Java : float a =…
Denys Séguret
  • 372,613
  • 87
  • 782
  • 758
11
votes
3 answers

IEEE Std 754 Floating-Point: let t := a - b, does the standard guarantee that a == b + t?

Assume that t,a,b are all double (IEEE Std 754) variables, and both values of a, b are NOT NaN (but may be Inf). After t = a - b, do I necessarily have a == b + t?
updogliu
  • 6,066
  • 7
  • 37
  • 50
10
votes
2 answers

How to subtract IEEE 754 numbers?

How do I subtract IEEE 754 numbers? For example: 0,546875 - 32.875... -> 0,546875 is 0 01111110 10001100000000000000000 in IEEE-754 -> -32.875 is 1 10000111 01000101111000000000000 in IEEE-754 So how do I do the subtraction? I know I have to to make…
Tiago Costa
  • 4,151
  • 12
  • 36
  • 54
10
votes
1 answer

Implementation of 32-bit floats or 64-bit longs in JavaScript?

Does anyone know of a JavaScript library that accurately implements the IEEE 754 specification for 32-bit floating-point values? I'm asking because I'm trying to write a cross-compiler in JavaScript, and since the source language has strict…
templatetypedef
  • 362,284
  • 104
  • 897
  • 1,065
10
votes
3 answers

Does floor() return something that's exactly representable?

In C89, floor() returns a double. Is the following guaranteed to work? double d = floor(3.0 + 0.5); int x = (int) d; assert(x == 3); My concern is that the result of floor might not be exactly representable in IEEE 754. So d gets something like…
Jim Hunziker
  • 14,111
  • 8
  • 58
  • 64
10
votes
1 answer

Why do higher-precision floating point formats have so many exponent bits?

I've been looking at floating point formats, both IEEE 754 and x87. Here's a summary: Total Bits per field Precision Bits Sign Exponent Mantissa Single 32 1 8 23 (+1 implicit) Double …
Adam Haun
  • 359
  • 7
  • 13
10
votes
1 answer

MSVC equivalent to GCC's -fno-finite-math-only?

On GCC, we enable -ffast-math to speed up floating point calculations. But as we rely on proper behavior of NaN and Inf floating point values, we also turn on -fno-finite-math-only, so that optimization which assume values aren't NaN/Inf For MSVC,…
R.M.
  • 3,461
  • 1
  • 21
  • 41
10
votes
1 answer

What is overflow and underflow in floating point

I feel I don't really understand the concept of overflow and underflow. I'm asking this question to clarify this. I need to understand it at its most basic level with bits. Let's work with the simplified floating point representation of 1 byte - 1…
Max Koretskyi
  • 101,079
  • 60
  • 333
  • 488
10
votes
2 answers

Converting hex string representation to float in python

I have data in IEEE 745 hexadecimal format: 0x1.5c28f5c28f5c3p-1 How would I convert this to a float in python? is this a standard module?
darktachyon
  • 258
  • 1
  • 2
  • 8
10
votes
2 answers

`std::sin` is wrong in the last bit

I am porting some program from Matlab to C++ for efficiency. It is important for the output of both programs to be exactly the same (**). I am facing different results for this operation: std::sin(0.497418836818383950) = 0.477158760259608410…
José D.
  • 4,175
  • 7
  • 28
  • 47
10
votes
4 answers

Will this C++ convert PDP-11 floats to IEEE?

I am maintaining a program that takes data from a PDP-11 (emulated!) program and puts it into a modern Windows-based system. We are having problems with some of the data values being reported as "1.#QNAN" and also "1.#QNB". The customer has recently…
user41013
  • 1,251
  • 2
  • 16
  • 25
10
votes
3 answers

What are the other NaN values?

The documentation for java.lang.Double.NaN says that it is A constant holding a Not-a-Number (NaN) value of type double. It is equivalent to the value returned by Double.longBitsToDouble(0x7ff8000000000000L). This seems to imply there are others.…
Simon Nickerson
  • 42,159
  • 20
  • 102
  • 127
10
votes
7 answers

Does 64-bit floating point numbers behave identically on all modern PCs?

I would like to know whether i can assume that same operations on same 64-bit floating point numbers gives exactly the same results on any modern PC and in most common programming languages? (C++, Java, C#, etc.). We can assume, that we are…
peper0
  • 3,111
  • 23
  • 35
10
votes
3 answers

How can I test for negative zero in Python?

I want to test if a number is positive or negative, especially also in the case of zero. IEEE-754 allows for -0.0, and it is implemented in Python. The only workarounds I could find were: def test_sign(x): return math.copysign(1, x) > 0 And…
quazgar
  • 4,304
  • 2
  • 29
  • 41
10
votes
4 answers

Why is a float "single precision"?

I'm curious as to why the IEEE calls a 32-bit floating-point number single precision. Was it just a means of standardization, or does 'single' actually refer to a single 'something'. Is it simply a standardized level? As in, precision level 1…
Keith Grout
  • 899
  • 11
  • 30