Questions tagged [ieee-754]

IEEE 754 is the most common & widely used floating-point standard, notably the single-precision binary32 aka float and double-precision binary64 aka double formats.

IEEE 754 is the Institute of Electrical and Electronics Engineers standard for floating-point computation, and is the most common & widely used implementation thereof.

As well as formats, IEEE754 also defines the basic operations, + - * / and sqrt, as producing correctly-rounded results (error <= 0.5ulp). Other functions like pow and sin are not required to be as accurate; that's an implementation choice between precision and performance.

This is why many CPU instruction sets only include the basic operations (including sqrt).

1447 questions
9
votes
1 answer

Interpreting a 32bit unsigned long as Single Precision IEEE-754 Float in C

I am using the XC32 compiler from Microchip, which is based on the standard C compiler. I am reading a 32bit value from a device on a RS485 network and storing this in a unsigned long that I have typedef'ed as DWORD. i.e. typedef DWORD unsigned…
Dino Alves
  • 113
  • 10
9
votes
2 answers

Does any floating point-intensive code produce bit-exact results in any x86-based architecture?

I would like to know if any code in C or C++ using floating point arithmetic would produce bit exact results in any x86 based architecture, regardless of the complexity of the code. To my knowledge, any x86 architecture since the Intel 8087 uses a…
Samuel Navarro Lou
  • 1,168
  • 6
  • 17
9
votes
1 answer

How are IEEE-754 single and double precision formats determined?

I'm interested in how these are determined: Single precision has: 8 bits for e and rest (23 bits) are mantissa Double precision: 11 bits for e and rest (52 bits) are mantissa ofc there is 1 bit for sign. So how is it determined what number of bits…
guber90
  • 91
  • 1
  • 9
9
votes
3 answers

Why is Number.MAX_VALUE 1.7976931348623157e+308 instead of 9007199254740992e+1024?

JavaScript uses IEEE 754 for storing numbers, for both integer and floating point values, where 53 bits are used for representing the mantissa and 11 bits are used for representing the exponent. The maximum value representable with a signed 53 bit…
Alex Mathew
  • 3,925
  • 5
  • 21
  • 25
9
votes
3 answers

Convert a string with a hex representation of an IEEE-754 double into JavaScript numeric variable

Suppose I have a hex number "4072508200000000" and I want the floating point number that it represents (293.03173828125000) in IEEE-754 double format to be put into a JavaScript variable. I can think of a way that uses some masking and a call to…
Nosredna
  • 83,000
  • 15
  • 95
  • 122
9
votes
4 answers

VC++ optimisations break comparisons with NaN?

IEEE754 requires NaNs to be unordered; less than, greater than, equal etc. should all return false when one or both operands are NaN. The sample below yields the correct F F F F F T as expected when compiled using g++ at all optimisation levels, and…
moonshadow
  • 86,889
  • 7
  • 82
  • 122
9
votes
3 answers

C++ Portable Floating-Point Bit Representation?

Is there a C++ Standards compliant way to determining the structure of a 'float', 'double', and 'long double' at compile-time ( or run-time, as an alternative )? If I assume std::numeric_limits< T >::is_iec559 == true and std::numeric_limits< T…
9
votes
3 answers

Floating point addition: loss-of-precision issues

In short: how can I execute a+b such that any loss-of-precision due to truncation is away from zero rather than toward zero? The Long Story I'm computing the sum of a long series of floating point values for the purpose of computing the sample mean…
Eamon Nerbonne
  • 47,023
  • 20
  • 101
  • 166
9
votes
6 answers

How do I save a floating-point number in 2 bytes?

Yes I'm aware of the IEEE-754 half-precision standard, and yes I'm aware of the work done in the field. Put very simply, I'm trying to save a simple floating point number (like 52.1, or 1.25) in just 2 bytes. I've tried some implementations in Java…
Robin Rodricks
  • 110,798
  • 141
  • 398
  • 607
8
votes
2 answers

addition instead of subtraction in Kahan algorithm

This is the Kahan summation algorithm from Wikipedia: function KahanSum(input) var sum = 0.0 var c = 0.0 for i = 1 to input.length do y = input[i] - c // why subtraction? t = sum + y c = (t - sum) - y …
fredoverflow
  • 256,549
  • 94
  • 388
  • 662
8
votes
2 answers

Are all single-precision numbers representable in the double-precision format?

Given an arbitrary number represented in the IEEE-754 single-precision format (commonly known as float in some languages/platforms) can I be certain that number can be represented exactly in the double-precision format as well? If so, is that…
R. Martinho Fernandes
  • 228,013
  • 71
  • 433
  • 510
8
votes
1 answer

Should two programs compiled with -O0 and -O2 each produce identical floating point results?

Short example: #include #include #include #define PRINTVAR(x) printVar(#x, (x) ) void printVar( const std::string_view name, const float value ) { std::cout << std::setw( 16 ) << name …
gbg
  • 197
  • 3
  • 11
8
votes
2 answers

Is 128-bit "long-float" useful?

I realized the other day that most common lisp had 128-bit "long-floats". As a result, the most positive long float is: 8.8080652584198167656 * 10^646456992 while the most positive double float is 1.7976931348623157 * 10^308, which is pretty big…
Thaddee Tyl
  • 1,126
  • 1
  • 12
  • 17
8
votes
2 answers

Some C Floating Point Constants Don't Make Sense

The constants in for Apple clang version 12.0.0 (clang-1200.0.32.2) don't seem to make sense. DBL_MIN_EXP is -1021 and DBL_MAX_EXP is 1024. However, that doesn't match what wikipedia says, "exponents range from −1022 to +1023, ..." Also…
Raymond Hettinger
  • 216,523
  • 63
  • 388
  • 485
8
votes
1 answer

Why can a naive abs implementation not be optimized well in C++?

I was looking at how a naive implementation of abs(float) would compile and was quite surprised by the result: float abs(float x) { return x < 0 ? -x : x; } With clang 10.1 at -O3, this results in: .LCPI0_0: .long 2147483648 …