Questions tagged [ieee-754]

IEEE 754 is the most common & widely used floating-point standard, notably the single-precision binary32 aka float and double-precision binary64 aka double formats.

IEEE 754 is the Institute of Electrical and Electronics Engineers standard for floating-point computation, and is the most common & widely used implementation thereof.

Wikipedia on IEEE 754 (2008)
ieee.org documentation
https://en.wikipedia.org/wiki/Single-precision_floating-point_format aka binary32, usually called float or real4. Nice diagrams of the bit-pattern, and range over which it can represent every integer exactly, and so on.
https://en.wikipedia.org/wiki/Double-precision_floating-point_format usually called double or real8
Algorithm to convert an IEEE 754 double to a string? including the recent Ryū: fast float-to-string conversion

As well as formats, IEEE754 also defines the basic operations, + - * / and sqrt, as producing correctly-rounded results (error <= 0.5ulp). Other functions like pow and sin are not required to be as accurate; that's an implementation choice between precision and performance.

This is why many CPU instruction sets only include the basic operations (including sqrt).

1447 questions

votes

0 answers

On 32-bit machines, atan2 is nondeterministic when I don't store its result in a variable. Why?

Consider this piece of C code: #include #include #include bool foo(int a, int b, int c, int d) { double P = atan2(a, b); double Q = atan2(c, d); return P < Q; } bool bar(int a, int b, int c, int d) { …

asked Jul 13 '20 at 16:35

Maya

1,490
12
24

votes

5 answers

What are the applications/benefits of an 80-bit extended precision data type?

Yeah, I meant to say 80-bit. That's not a typo... My experience with floating point variables has always involved 4-byte multiples, like singles (32 bit), doubles (64 bit), and long doubles (which I've seen referred to as either 96-bit or 128-bit).…

floating-point ieee-754 x87 long-double extended-precision

asked Mar 04 '09 at 21:25

gnovice

125,304
15
256
359

votes

2 answers

Fused multiply add and default rounding modes

With GCC 5.3 the following code compield with -O3 -fma float mul_add(float a, float b, float c) { return a*b + c; } produces the following assembly vfmadd132ss %xmm1, %xmm2, %xmm0 ret I noticed GCC doing this with -O3 already in GCC…

c gcc clang ieee-754 fma

asked Dec 23 '15 at 12:57

Z boson

32,619
11
123
226

votes

1 answer

Why does GCC yield -nan and clang and intel yield +nan for 0.0/0.0?

When I was debugging code, I found that GCC and Clang both yield nan for 0.0/0.0 which is what I was expecting, but GCC yields an nan with the sign bit set to 1, while Clang sets it to 0 (in agreement with ICC, if I remember correctly). Now…

c++ gcc nan ieee-754

asked Aug 26 '15 at 08:04

Johannes Schaub - litb

496,577
130
894
1,212

votes

8 answers

Fast nearest power of 2 in JavaScript?

Is there any faster alternative to the following expression: Math.pow(2,Math.floor(Math.log(x)/Math.log(2))) That is, taking the closest (smaller) integer power of 2 of a double? I have such expression in a inner loop. I suspect it could be much…

javascript double bit-manipulation ieee-754

asked Nov 17 '14 at 03:43

MaiaVictor

51,090
44
144
286

votes

2 answers

Extreme numerical values in floating-point precision in R

Can somebody please explain me the following output. I know that it has something to do with floating point precision, but the order of magnitue (difference 1e308) surprises me. 0: high precision > 1e-324==0 [1] TRUE > 1e-323==0 [1] FALSE 1: very…

r floating-point rounding precision ieee-754

asked Jul 20 '14 at 06:12

user3370602

votes

1 answer

Are there any whole numbers which the double cannot represent within the MIN/MAX range of a double?

I realize that whenever one is dealing with IEEE 754 doubles and floats, some numbers can't be represented especially when one tries to represent numbers with lots of digits after the decimal point. This is well understood but I was curious if…

floating-point precision ieee-754

asked Oct 20 '13 at 02:43

Brett

4,066
8
36
50

votes

3 answers

What is long double on x86-64?

Someone told me that: Under x86-64, FP arithmetic is done with SSE, and therefore long double is 64 bits. But in the x86-64 ABI it says that: C type sizeof alignment AMD64 Architecture long double 16 16 80-bit extended (IEEE-754) See:…

c linux floating-point x86-64 ieee-754

asked Mar 02 '13 at 15:45

Andrew Tomazos

66,139
40
186
319

votes

4 answers

How to simulate Single precision rounding with Doubles?

i had a problem where i was trying to reconstruct the the formula used in an existing system, a fairly simple formula of one input and one output: y = f(x) After a lot of puzzling, we managed to figure out the formula that fit our observed data…

floating-point double floating-accuracy ieee-754

asked Sep 23 '12 at 14:28

Ian Boyd

246,734
253
869
1,219

votes

2 answers

On the float_precision argument to pandas.read_csv

The documentation for the argument in this post's title says: float_precision : string, default None Specifies which converter the C engine should use for floating-point values. The options are None for the ordinary converter, high for the…

python algorithm pandas floating-point ieee-754

asked Jun 22 '17 at 11:12

kjo

33,683
52
148
265

votes

3 answers

Are there any modern platforms with non-IEEE C/C++ float formats?

I am writing a video game, Humm and Strumm, which requires a network component in its game engine. I can deal with differences in endianness easily, but I have hit a wall in attempting to deal with possible float memory formats. I know that modern…

c++ memory network-programming types ieee-754

asked Apr 27 '10 at 19:27

Patrick Niedzielski

1,194
1
8
21

votes

3 answers

Why does table-based sin approximation literature always use this formula when another formula seems to make more sense?

The literature on computing the elementary function sin with tables refers to the formula: sin(x) = sin(Cn) * cos(h) + cos(Cn) * sin(h) where x = Cn + h, Cn is a constant for which sin(Cn) and cos(Cn) have been pre-computed and are available in a…

floating-point ieee-754 elementary-functions

asked May 16 '14 at 19:44

Pascal Cuoq

79,187
7
161
281

votes

2 answers

Handling money value, is it safe to divide a number by 100?

In the repository code, in a module developed by another team, I discovered that there is a conversion of a price from cents to euro, just dividing the number by 100. The code is in Javascript, so it uses the IEEE 754 standard. I know that is not…

javascript ieee-754 ieee

asked Mar 12 '19 at 17:34

Christian Vincenzo Traina

9,546
2
40
63

votes

3 answers

Go float comparison

In order to compare two floats (float64) for equality in Go, my superficial understanding of IEEE 754 and binary representation of floats makes me think that this is a good solution: func Equal(a, b float64) bool { ba := math.Float64bits(a) …

go floating-point ieee-754

asked Dec 25 '17 at 14:11

augustzf

2,385
1
16
22

votes

1 answer

Is there any IEEE 754 standard implementations for Java floating point primitives?

I'm interested if Java is using IEEE 754 standard for implementing its floating point arithmetic. Here I saw this kind of thing in documentation: operation defined in IEEE 754-2008 As I understand positive side of IEEE 754 is to increase…

java floating-point double bigdecimal ieee-754

asked Oct 28 '16 at 07:21

GROX13

4,605
4
27
41

Prev 1 2 3

…

96 97 Next