Questions tagged [ieee-754]

IEEE 754 is the most common & widely used floating-point standard, notably the single-precision binary32 aka float and double-precision binary64 aka double formats.

IEEE 754 is the Institute of Electrical and Electronics Engineers standard for floating-point computation, and is the most common & widely used implementation thereof.

As well as formats, IEEE754 also defines the basic operations, + - * / and sqrt, as producing correctly-rounded results (error <= 0.5ulp). Other functions like pow and sin are not required to be as accurate; that's an implementation choice between precision and performance.

This is why many CPU instruction sets only include the basic operations (including sqrt).

1447 questions
12
votes
2 answers

Properties of 80-bit extended precision computations starting from double precision arguments

Here are two implementations of interpolation functions. Argument u1 is always between 0. and 1.. #include double interpol_64(double u1, double u2, double u3) { return u2 * (1.0 - u1) + u1 * u3; } double interpol_80(double u1,…
Pascal Cuoq
  • 79,187
  • 7
  • 161
  • 281
11
votes
10 answers

Can I guarantee the C++ compiler will not reorder my calculations?

I'm currently reading through the excellent Library for Double-Double and Quad-Double Arithmetic paper, and in the first few lines I notice they perform a sum in the following way: std::pair TwoSum(double a, double b) { double s…
Mike Bailey
  • 12,479
  • 14
  • 66
  • 123
11
votes
3 answers

what languages get IEEE 754 right?

I just spend my week messing with the subject, and found no language that get the IEEE 754 spec right. Even GCC doesn't respect the relevant C99 part (it ignores the FENV_ACCESS pragma, and I've been told than my working examples were sheer…
nraynaud
  • 4,924
  • 7
  • 39
  • 54
11
votes
4 answers

Difference in casting float to int, 32-bit C

I currently working with an old code that needs to run a 32-bit system. During this work I stumbled across an issue that (out of academic interest) I would like to understand the cause of. It seems that casting from float to int in 32-bit C behaves…
cpaitor
  • 423
  • 1
  • 3
  • 16
11
votes
1 answer

Why does NaN exist?

I am not asking "why does this calculation result in NaN", I am asking "Why does NaN exist at all, rather than resulting in an exception or error?" I've been wondering this for a while, and discussed it with people occationally. The only answers…
Harald Kanin
  • 133
  • 5
11
votes
1 answer

Does unary minus just change sign?

Consider for example the following double-precision numbers: x = 1232.2454545e-89; y = -1232.2454545e-89; Can I be sure that y is always exactly equal to -x (or Matlab's uminus(x))? Or should I expect small numerical differences of the order or eps…
Luis Mendo
  • 110,752
  • 13
  • 76
  • 147
11
votes
4 answers

how IEEE-754 floating point numbers work

Let's say I have this: float i = 1.5 in binary, this float is represented as: 0 01111111 10000000000000000000000 I broke up the binary to represent the 'signed', 'exponent' and 'fraction' chunks. What I don't understand is how this represents…
Tony Stark
  • 24,588
  • 41
  • 96
  • 113
11
votes
1 answer

Why does frexp() not yield scientific notation?

Scientific notation is the common way to express a number with an explicit order of magnitude. First a nonzero digit, then a radix point, then a fractional part, and the exponent. In binary, there is only one possible nonzero digit. Floating-point…
Potatoswatter
  • 134,909
  • 25
  • 265
  • 421
11
votes
1 answer

Fastest algorithm to identify the smallest and largest x that make the double-precision equation x + a == b true

In the context of static analysis, I am interested in determining the values of x in the then-branch of the conditional below: double x; x = …; if (x + a == b) { … a and b can be assumed to be double-precision constants (generalizing to arbitrary…
Pascal Cuoq
  • 79,187
  • 7
  • 161
  • 281
11
votes
1 answer

Floating point arithmetic and reproducibility

Is IEEE-754 arithmetic reproducible on different platforms? I was testing some code written in R, that uses random numbers. I thought that setting the seed of the random number generator on all tested platforms would make the tests reproducible,…
Gabor Csardi
  • 10,705
  • 1
  • 36
  • 53
11
votes
3 answers

Converting Int to Float or Float to Int using Bitwise operations (software floating point)

I was wondering if you could help explain the process on converting an integer to float, or a float to an integer. For my class, we are to do this using only bitwise operators, but I think a firm understanding on the casting from type to type will…
Andrew T
  • 783
  • 4
  • 11
  • 20
11
votes
1 answer

Rounding Floating Point Numbers after addition (guard, sticky, and round bits)

I haven't been able to find a good explanation of this anywhere on the web yet, so I'm hoping somebody here can explain it for me. I want to add two binary numbers by hand: 1.0012 * 22 1.010,0000,0000,0000,0000,00112 * 21 I can add them no…
audiFanatic
  • 2,296
  • 8
  • 40
  • 56
11
votes
4 answers

Computing a correctly rounded / an almost correctly rounded floating-point cubic root

Suppose that correctly rounded standard library functions such as found in CRlibm are available. Then how would one compute the correctly rounded cubic root of a double-precision input? This question is not an “actual problem that [I] face”, to…
Pascal Cuoq
  • 79,187
  • 7
  • 161
  • 281
11
votes
4 answers

Python float - str - float weirdness

>>> float(str(0.65000000000000002)) 0.65000000000000002 >>> float(str(0.47000000000000003)) 0.46999999999999997 ??? What is going on here? How do I convert 0.47000000000000003 to string and the resultant value back to float? I am using…
Sharun
  • 2,030
  • 4
  • 22
  • 36
11
votes
2 answers

Lua - packing IEEE754 single-precision floating-point numbers

I want to make a function in pure Lua that generates a fraction (23 bits), an exponent (8 bits), and a sign (1 bit) from a number, so that the number is approximately equal to math.ldexp(fraction, exponent - 127) * (sign == 1 and -1 or 1), and then…
RPFeltz
  • 1,049
  • 2
  • 12
  • 21