Questions tagged [ieee-754]

IEEE 754 is the most common & widely used floating-point standard, notably the single-precision binary32 aka float and double-precision binary64 aka double formats.

IEEE 754 is the Institute of Electrical and Electronics Engineers standard for floating-point computation, and is the most common & widely used implementation thereof.

Wikipedia on IEEE 754 (2008)
ieee.org documentation
https://en.wikipedia.org/wiki/Single-precision_floating-point_format aka binary32, usually called float or real4. Nice diagrams of the bit-pattern, and range over which it can represent every integer exactly, and so on.
https://en.wikipedia.org/wiki/Double-precision_floating-point_format usually called double or real8
Algorithm to convert an IEEE 754 double to a string? including the recent Ryū: fast float-to-string conversion

As well as formats, IEEE754 also defines the basic operations, + - * / and sqrt, as producing correctly-rounded results (error <= 0.5ulp). Other functions like pow and sin are not required to be as accurate; that's an implementation choice between precision and performance.

This is why many CPU instruction sets only include the basic operations (including sqrt).

1447 questions

votes

3 answers

Negative zero literal in golang

IEEE754 supports the negative zero. But this code a := -0.0 fmt.Println(a, 1/a) outputs 0 +Inf where I would have expected -0 -Inf Other languages whose float format is based on IEEE754 let you create negative zero literals Java : float a =…

math floating-point go ieee-754

asked Dec 10 '12 at 15:46

Denys Séguret

372,613
87
782
758

votes

3 answers

IEEE Std 754 Floating-Point: let t := a - b, does the standard guarantee that a == b + t?

Assume that t,a,b are all double (IEEE Std 754) variables, and both values of a, b are NOT NaN (but may be Inf). After t = a - b, do I necessarily have a == b + t?

c++ c floating-point ieee-754

asked May 29 '12 at 00:51

updogliu

6,066
7
37
50

votes

2 answers

How to subtract IEEE 754 numbers?

How do I subtract IEEE 754 numbers? For example: 0,546875 - 32.875... -> 0,546875 is 0 01111110 10001100000000000000000 in IEEE-754 -> -32.875 is 1 10000111 01000101111000000000000 in IEEE-754 So how do I do the subtraction? I know I have to to make…

math floating-point ieee-754

asked Jan 07 '12 at 00:29

Tiago Costa

4,151
12
36
54

votes

1 answer

Implementation of 32-bit floats or 64-bit longs in JavaScript?

Does anyone know of a JavaScript library that accurately implements the IEEE 754 specification for 32-bit floating-point values? I'm asking because I'm trying to write a cross-compiler in JavaScript, and since the source language has strict…

javascript floating-point long-integer ieee-754

asked Jan 26 '11 at 06:16

templatetypedef

362,284
104
897
1,065

votes

3 answers

Does floor() return something that's exactly representable?

In C89, floor() returns a double. Is the following guaranteed to work? double d = floor(3.0 + 0.5); int x = (int) d; assert(x == 3); My concern is that the result of floor might not be exactly representable in IEEE 754. So d gets something like…

c floating-point ieee-754 c89 floor

asked Jan 13 '09 at 18:33

Jim Hunziker

14,111
8
58
64

votes

1 answer

Why do higher-precision floating point formats have so many exponent bits?

I've been looking at floating point formats, both IEEE 754 and x87. Here's a summary: Total Bits per field Precision Bits Sign Exponent Mantissa Single 32 1 8 23 (+1 implicit) Double …

floating-point ieee-754

asked Nov 23 '16 at 23:40

Adam Haun

votes

1 answer

MSVC equivalent to GCC's -fno-finite-math-only?

On GCC, we enable -ffast-math to speed up floating point calculations. But as we rely on proper behavior of NaN and Inf floating point values, we also turn on -fno-finite-math-only, so that optimization which assume values aren't NaN/Inf For MSVC,…

c++ visual-c++ ieee-754

asked Oct 21 '16 at 16:17

R.M.

3,461
1
21
41

votes

1 answer

What is overflow and underflow in floating point

I feel I don't really understand the concept of overflow and underflow. I'm asking this question to clarify this. I need to understand it at its most basic level with bits. Let's work with the simplified floating point representation of 1 byte - 1…

javascript floating-point ieee-754

asked Oct 17 '16 at 09:11

Max Koretskyi

101,079
60
333
488

votes

2 answers

Converting hex string representation to float in python

I have data in IEEE 745 hexadecimal format: 0x1.5c28f5c28f5c3p-1 How would I convert this to a float in python? is this a standard module?

python hex ieee-754

asked Nov 30 '15 at 15:51

darktachyon

votes

2 answers

`std::sin` is wrong in the last bit

I am porting some program from Matlab to C++ for efficiency. It is important for the output of both programs to be exactly the same (**). I am facing different results for this operation: std::sin(0.497418836818383950) = 0.477158760259608410…

c++ matlab floating-point ieee-754

asked May 29 '15 at 12:45

José D.

4,175
7
28
47

votes

4 answers

Will this C++ convert PDP-11 floats to IEEE?

I am maintaining a program that takes data from a PDP-11 (emulated!) program and puts it into a modern Windows-based system. We are having problems with some of the data values being reported as "1.#QNAN" and also "1.#QNB". The customer has recently…

floating-point ieee-754

asked Feb 12 '10 at 09:07

user41013

1,251
2
16
25

votes

3 answers

What are the other NaN values?

The documentation for java.lang.Double.NaN says that it is A constant holding a Not-a-Number (NaN) value of type double. It is equivalent to the value returned by Double.longBitsToDouble(0x7ff8000000000000L). This seems to imply there are others.…

java floating-point ieee-754 nan

asked Jan 28 '10 at 12:36

Simon Nickerson

42,159
20
102
127

votes

7 answers

Does 64-bit floating point numbers behave identically on all modern PCs?

I would like to know whether i can assume that same operations on same 64-bit floating point numbers gives exactly the same results on any modern PC and in most common programming languages? (C++, Java, C#, etc.). We can assume, that we are…

64-bit floating-point portability ieee-754

asked Jan 27 '10 at 20:07

peper0

3,111
23
35

votes

3 answers

How can I test for negative zero in Python?

I want to test if a number is positive or negative, especially also in the case of zero. IEEE-754 allows for -0.0, and it is implemented in Python. The only workarounds I could find were: def test_sign(x): return math.copysign(1, x) > 0 And…

python ieee-754

asked Oct 11 '13 at 11:50

quazgar

4,304
2
29
41

votes

4 answers

Why is a float "single precision"?

I'm curious as to why the IEEE calls a 32-bit floating-point number single precision. Was it just a means of standardization, or does 'single' actually refer to a single 'something'. Is it simply a standardized level? As in, precision level 1…

floating-point double ieee-754 single-precision

asked Jul 19 '13 at 21:05

Keith Grout

Prev 1 2 3

…

96 97 Next