Questions tagged [ieee-754]

IEEE 754 is the most common & widely used floating-point standard, notably the single-precision binary32 aka float and double-precision binary64 aka double formats.

IEEE 754 is the Institute of Electrical and Electronics Engineers standard for floating-point computation, and is the most common & widely used implementation thereof.

Wikipedia on IEEE 754 (2008)
ieee.org documentation
https://en.wikipedia.org/wiki/Single-precision_floating-point_format aka binary32, usually called float or real4. Nice diagrams of the bit-pattern, and range over which it can represent every integer exactly, and so on.
https://en.wikipedia.org/wiki/Double-precision_floating-point_format usually called double or real8
Algorithm to convert an IEEE 754 double to a string? including the recent Ryū: fast float-to-string conversion

As well as formats, IEEE754 also defines the basic operations, + - * / and sqrt, as producing correctly-rounded results (error <= 0.5ulp). Other functions like pow and sin are not required to be as accurate; that's an implementation choice between precision and performance.

This is why many CPU instruction sets only include the basic operations (including sqrt).

1447 questions

votes

2 answers

Properties of 80-bit extended precision computations starting from double precision arguments

Here are two implementations of interpolation functions. Argument u1 is always between 0. and 1.. #include double interpol_64(double u1, double u2, double u3) { return u2 * (1.0 - u1) + u1 * u3; } double interpol_80(double u1,…

c floating-point ieee-754 extended-precision

asked Dec 05 '12 at 14:48

Pascal Cuoq

79,187
7
161
281

votes

10 answers

Can I guarantee the C++ compiler will not reorder my calculations?

I'm currently reading through the excellent Library for Double-Double and Quad-Double Arithmetic paper, and in the first few lines I notice they perform a sum in the following way: std::pair TwoSum(double a, double b) { double s…

c++ optimization floating-point ieee-754

asked Oct 05 '11 at 02:29

Mike Bailey

12,479
14
66
123

votes

3 answers

what languages get IEEE 754 right?

I just spend my week messing with the subject, and found no language that get the IEEE 754 spec right. Even GCC doesn't respect the relevant C99 part (it ignores the FENV_ACCESS pragma, and I've been told than my working examples were sheer…

ieee-754

asked Apr 04 '09 at 01:47

nraynaud

4,924
7
39
54

votes

4 answers

Difference in casting float to int, 32-bit C

I currently working with an old code that needs to run a 32-bit system. During this work I stumbled across an issue that (out of academic interest) I would like to understand the cause of. It seems that casting from float to int in 32-bit C behaves…

c casting floating-point ieee-754 32-bit

asked Feb 26 '19 at 08:50

cpaitor

votes

1 answer

Why does NaN exist?

I am not asking "why does this calculation result in NaN", I am asking "Why does NaN exist at all, rather than resulting in an exception or error?" I've been wondering this for a while, and discussed it with people occationally. The only answers…

floating-point nan ieee-754

asked Dec 11 '18 at 08:45

Harald Kanin

votes

1 answer

Does unary minus just change sign?

Consider for example the following double-precision numbers: x = 1232.2454545e-89; y = -1232.2454545e-89; Can I be sure that y is always exactly equal to -x (or Matlab's uminus(x))? Or should I expect small numerical differences of the order or eps…

matlab language-agnostic ieee-754 numerical

asked Dec 02 '15 at 19:19

Luis Mendo

110,752
13
76
147

votes

4 answers

how IEEE-754 floating point numbers work

Let's say I have this: float i = 1.5 in binary, this float is represented as: 0 01111111 10000000000000000000000 I broke up the binary to represent the 'signed', 'exponent' and 'fraction' chunks. What I don't understand is how this represents…

types floating-point ieee-754

asked Apr 25 '10 at 01:49

Tony Stark

24,588
41
96
113

votes

1 answer

Why does frexp() not yield scientific notation?

Scientific notation is the common way to express a number with an explicit order of magnitude. First a nonzero digit, then a radix point, then a fractional part, and the exponent. In binary, there is only one possible nonzero digit. Floating-point…

c floating-point posix ieee-754

asked Jul 24 '14 at 08:34

Potatoswatter

134,909
25
265
421

votes

1 answer

Fastest algorithm to identify the smallest and largest x that make the double-precision equation x + a == b true

In the context of static analysis, I am interested in determining the values of x in the then-branch of the conditional below: double x; x = …; if (x + a == b) { … a and b can be assumed to be double-precision constants (generalizing to arbitrary…

c floating-point ieee-754

asked Jun 14 '14 at 18:12

Pascal Cuoq

79,187
7
161
281

votes

1 answer

Floating point arithmetic and reproducibility

Is IEEE-754 arithmetic reproducible on different platforms? I was testing some code written in R, that uses random numbers. I thought that setting the seed of the random number generator on all tested platforms would make the tests reproducible,…

r floating-point ieee-754

asked Jan 19 '14 at 01:56

Gabor Csardi

10,705
1
36
53

votes

3 answers

Converting Int to Float or Float to Int using Bitwise operations (software floating point)

I was wondering if you could help explain the process on converting an integer to float, or a float to an integer. For my class, we are to do this using only bitwise operators, but I think a firm understanding on the casting from type to type will…

assembly floating-point arm bit-manipulation ieee-754

asked Nov 30 '13 at 16:50

Andrew T

votes

1 answer

Rounding Floating Point Numbers after addition (guard, sticky, and round bits)

I haven't been able to find a good explanation of this anywhere on the web yet, so I'm hoping somebody here can explain it for me. I want to add two binary numbers by hand: 1.0012 * 22 1.010,0000,0000,0000,0000,00112 * 21 I can add them no…

floating-point ieee-754

asked Oct 02 '13 at 20:21

audiFanatic

2,296
8
40
56

votes

4 answers

Computing a correctly rounded / an almost correctly rounded floating-point cubic root

Suppose that correctly rounded standard library functions such as found in CRlibm are available. Then how would one compute the correctly rounded cubic root of a double-precision input? This question is not an “actual problem that [I] face”, to…

algorithm floating-point ieee-754

asked Aug 05 '13 at 17:08

Pascal Cuoq

79,187
7
161
281

votes

4 answers

Python float - str - float weirdness

>>> float(str(0.65000000000000002)) 0.65000000000000002 >>> float(str(0.47000000000000003)) 0.46999999999999997 ??? What is going on here? How do I convert 0.47000000000000003 to string and the resultant value back to float? I am using…

python string floating-point floating-accuracy ieee-754

asked Nov 22 '09 at 10:28

Sharun

2,030
4
22
36

votes

2 answers

Lua - packing IEEE754 single-precision floating-point numbers

I want to make a function in pure Lua that generates a fraction (23 bits), an exponent (8 bits), and a sign (1 bit) from a number, so that the number is approximately equal to math.ldexp(fraction, exponent - 127) * (sign == 1 and -1 or 1), and then…

floating-point lua ieee-754 pack

asked Jan 19 '13 at 17:11

RPFeltz

1,049
2
12
21

Prev 1 2 3

…

96 97 Next