Questions tagged [ieee-754]

IEEE 754 is the most common & widely used floating-point standard, notably the single-precision binary32 aka float and double-precision binary64 aka double formats.

IEEE 754 is the Institute of Electrical and Electronics Engineers standard for floating-point computation, and is the most common & widely used implementation thereof.

As well as formats, IEEE754 also defines the basic operations, + - * / and sqrt, as producing correctly-rounded results (error <= 0.5ulp). Other functions like pow and sin are not required to be as accurate; that's an implementation choice between precision and performance.

This is why many CPU instruction sets only include the basic operations (including sqrt).

1447 questions
46
votes
9 answers

Portability of binary serialization of double/float type in C++

The C++ standard does not discuss the underlying layout of float and double types, only the range of values they should represent. (This is also true for signed types, is it two's compliment or something else) My question is: What the are…
Matthieu N.
44
votes
1 answer

What would cause the C/C++ <, <=, and == operators to return true if either argument is NaN?

My understanding of the rules of IEEE-754 floating-point comparisons is that all comparison operators except != will return false if either or both arguments are NaN, while the != operator will return true. I can easily reproduce this behavior with…
Sean
  • 977
  • 6
  • 13
42
votes
5 answers

Do any real-world CPUs not use IEEE 754?

I'm optimizing a sorting function for a numerics/statistics library based on the assumption that, after filtering out any NaNs and doing a little bit twiddling, floats can be compared as 32-bit ints without changing the result and doubles can be…
dsimcha
  • 67,514
  • 53
  • 213
  • 334
42
votes
2 answers

sign changes when going from int to float and back

Consider the following code, which is an SSCCE of my actual problem: #include int roundtrip(int x) { return int(float(x)); } int main() { int a = 2147483583; int b = 2147483584; std::cout << a << " -> " << roundtrip(a)…
fredoverflow
  • 256,549
  • 94
  • 388
  • 662
42
votes
7 answers

In binary notation, what is the meaning of the digits after the radix point "."?

I have this example on how to convert from a base 10 number to IEEE 754 float representation Number: 45.25 (base 10) = 101101.01 (base 2) Sign: 0 Normalized form N = 1.0110101 * 2^5 Exponent esp = 5 E = 5 + 127 = 132 (base 10) = 10000100 (base…
Johnny Pauling
  • 12,701
  • 18
  • 65
  • 108
41
votes
4 answers

Does the C++ standard specify anything on the representation of floating point numbers?

For types T for which std::is_floating_point::value is true, does the C++ standard specify anything on the way that T should be implemented? For example, does T has even to follow a sign/mantissa/exponent representation? Or can it be completely…
Vincent
  • 57,703
  • 61
  • 205
  • 388
38
votes
4 answers

Converting IEEE 754 floating point in Haskell Word32/64 to and from Haskell Float/Double

Question In Haskell, the base libraries and Hackage packages provide several means of converting binary IEEE-754 floating point data to and from the lifted Float and Double types. However, the accuracy, performance, and portability of these methods…
acfoltzer
  • 5,588
  • 31
  • 48
38
votes
6 answers

Ranges of floating point datatype in C?

I am reading a C book, talking about ranges of floating point, the author gave the table: Type Smallest Positive Value Largest value Precision ==== ======================= ============= ========= float 1.17549 x 10^-38 …
ipkiss
  • 13,311
  • 33
  • 88
  • 123
37
votes
5 answers

Half-precision floating-point in Java

Is there a Java library anywhere that can perform computations on IEEE 754 half-precision numbers or convert them to and from double-precision? Either of these approaches would be suitable: Keep the numbers in half-precision format and compute…
finnw
  • 47,861
  • 24
  • 143
  • 221
37
votes
3 answers

The Double byte size in 32 bit and 64 bit OS

Is there a difference in double size when I run my app on 32 and 64 bit environment? If I am not mistaken the double in 32 bit environment will take up 16 digits after 0, whereas the double in 64 bit will take up 32 bit, am I right?
Graviton
  • 81,782
  • 146
  • 424
  • 602
36
votes
2 answers

Why is the square root of -Infinity +Infinity in Java?

I tried two different ways to find the square root in Java: Math.sqrt(Double.NEGATIVE_INFINITY); // NaN Math.pow(Double.NEGATIVE_INFINITY, 0.5); // Infinity Why doesn't the second way return the expected answer which is NaN (same as with the first…
Pratik
  • 908
  • 2
  • 11
  • 34
36
votes
3 answers

Is a float guaranteed to be preserved when transported through a double in C/C++?

Assuming IEEE-754 conformance, is a float guaranteed to be preserved when transported through a double? In other words, will the following assert always be satisfied? int main() { float f = some_random_float(); assert(f ==…
Kristian Spangsege
  • 2,903
  • 1
  • 20
  • 43
35
votes
5 answers

How computer does floating point arithmetic?

I have seen long articles explaining how floating point numbers can be stored and how the arithmetic of those numbers is being done, but please briefly explain why when I write cout << 1.0 / 3.0 <
Narek
  • 38,779
  • 79
  • 233
  • 389
35
votes
2 answers

How to check if C++ compiler uses IEEE 754 floating point standard

I would like to ask a question that follows this one which is pretty well answered by the define check if the compiler uses the standard. However this woks for C only. Is there a way to do the same in C++? I do not wish to covert floating point…
Rusty Horse
  • 2,388
  • 7
  • 26
  • 38
34
votes
3 answers

Is there any accuracy gain when casting to double and back when doing float division?

What is the difference between two following? float f1 = some_number; float f2 = some_near_zero_number; float result; result = f1 / f2; and: float f1 = some_number; float f2 = some_near_zero_number; float result; result = (double)f1 /…
Piotr Lopusiewicz
  • 2,514
  • 2
  • 27
  • 38
1 2
3
96 97