Questions tagged [ieee-754]

IEEE 754 is the most common & widely used floating-point standard, notably the single-precision binary32 aka float and double-precision binary64 aka double formats.

IEEE 754 is the Institute of Electrical and Electronics Engineers standard for floating-point computation, and is the most common & widely used implementation thereof.

Wikipedia on IEEE 754 (2008)
ieee.org documentation
https://en.wikipedia.org/wiki/Single-precision_floating-point_format aka binary32, usually called float or real4. Nice diagrams of the bit-pattern, and range over which it can represent every integer exactly, and so on.
https://en.wikipedia.org/wiki/Double-precision_floating-point_format usually called double or real8
Algorithm to convert an IEEE 754 double to a string? including the recent Ryū: fast float-to-string conversion

As well as formats, IEEE754 also defines the basic operations, + - * / and sqrt, as producing correctly-rounded results (error <= 0.5ulp). Other functions like pow and sin are not required to be as accurate; that's an implementation choice between precision and performance.

This is why many CPU instruction sets only include the basic operations (including sqrt).

1447 questions

-1

votes

1 answer

Looking for faster method of doing this double function

I am looking to speed up a function that maps one double number to another double number. However the function must remain the same. Same same input must produce exactly the same output as before. The reason for this is we don't want to introduce…

asked Dec 04 '14 at 20:49

steviekm3

-1

votes

4 answers

C++ pow() function produces arbitrarily precise result

If double precision does not guarantee more than 16 significant decimal digits, how is such an output generated by this standard C++ program? Also small value change operations done on "ans" such as ++ans don't alter the screen output. Is the answer…

c++ ieee-754 pow arbitrary-precision

asked Aug 22 '14 at 13:23

user2126377

-1

votes

1 answer

how to convert floating-point number to IEEE 754 using assembly

can you please help me to convert floating-point number to IEEE 754 using assembly i have this number -1.75 and i know it equla to -1.11000000000000000000000 E+0 on IEEE754 but i dont know how to do the convert in assembly

assembly floating-point masm ieee-754

asked May 08 '14 at 11:12

Sideeq Youssef

-1

votes

1 answer

Convert binary or hexadecimal string using php into 32-bit float value. Big endian \ Little endian

How can I convert 32-bit binary string like 00111001101010000101110000100010 or hexadecimal string like 39a85c22 into float value? For use in zend framework.

php zend-framework binary hex ieee-754

asked Mar 10 '14 at 15:34

shukshin.ivan

11,075
4
53
69

-1

votes

2 answers

How tricky is floating point in storing a value in memory

Say,I have to store 2147483648 as a float(not as a fixed-point number like integer) in a 32-bit system. For this what will be the mantissa (significand) and exponent ? And how this number is represented in memory?

floating-point ieee-754

asked Nov 18 '13 at 14:36

Parveez Ahmed

1,325
4
17
28

-1

votes

1 answer

Printing error while accessing consecutive memory locations

Following is a code to see how different data types are stored in memory. #include void newline(void) { putchar('\n'); } void showbyte(char *string, int len) { int i; for (i = 0; i < len; i++) printf("%p\t0x%.2x\n",…

c pointers memory floating-point ieee-754

asked Oct 09 '13 at 04:22

noufal

-1

votes

2 answers

Same floating point operation, different results

I really can't wrap my head around the fact that this code gives 2 results for the same formula: #include #include int main() { // std::cout.setf(std::ios::fixed, std::ios::floatfield); std::cout.precision(20); float a =…

c++ floating-point ieee-754

asked Aug 06 '13 at 08:20

user2485710

9,451
13
58
102

-1

votes

2 answers

Analyzing IEEE 754 bit patterns

I'm working on an assignment but I'm stuck. For some reason I can't get this outcome: byte order: little-endian > FFFFFFFF 0xFFFFFFFF signBit 1, expBits 255, fractBits 0x007FFFFF QNaN > 3 0x00000003 signBit 0, expBits 0, fractBits…

c floating-point ieee-754

asked Feb 10 '13 at 11:49

numbplum

-2

votes

2 answers

1.0 / 0.0 - valid statement?

I just wanted to apply an infinity load-factor to a std::set<> because I wanted to have a fixed number of buckets. So I used a load-factor of 1.0f / 0.0f because it's shorter to write than numeric_limits::infinity(). MSVC give an error…

c++ ieee-754

asked Dec 29 '21 at 20:25

Bonita Montero

2,817
9
22

-2

votes

1 answer

C# NumericUpDown with IEEE754 Single Precision Resolution

I want to be able to construct a numericUpDown field in C# where on each arrow click the value in the field increases by one resolution. For example; Let's say my value is 20.0, and it's IEEE754 hexadecimal representation is 0x41a00000. If I click…

c# floating-point hex ieee-754

asked Oct 07 '20 at 06:08

efedoganay

-2

votes

1 answer

Why does casting float64 to int32 give a negative number in Go?

Occasionally, we cast float64 to int32 directly by mistake in Golang raw = 529538871408 fv = float64(raw) fmt.Println(raw) fmt.Println(fv) fmt.Println(int32(fv)) Output 529538871408 5.29538871408e+11 -2147483648 Why…

c++ go floating-point ieee-754

asked Jun 12 '20 at 05:09

zangw

43,869
19
177
214

-2

votes

1 answer

ATmega64a float to IEEE-754 unexpected result

I am trying to convert a float to an IEEE-754 Hex representation. The following code works on my Mac. #include #include union Data { int i; float f; }; int main() { float var = 502.7; union Data value; …

c floating-point ieee-754 atmega atmelstudio

asked Mar 16 '20 at 21:13

user1757006

-2

votes

1 answer

Java: IEEE Doubles to IBM Float

I am working on a side project at work where I would like to read/write SAS Transport files. The challenge is that numbers are encoded in 64-bit IBM floating point numbers. While I have been able to find plenty of great resources for reading a byte…

java ieee-754

asked Mar 13 '19 at 22:51

Travis Parks

8,435
12
52
85

-2

votes

2 answers

can somebody explain why does my first print 0, but after *p = 6.35, it can print 6.35?

#include void print_binary(int n); void test(); int main(){ test(); return 0; } void print_binary (int n){ unsigned int mask = 0; mask = ~mask^(~mask >> 1); for (; mask != 0; mask >>= 1){ putchar ((n & mask) ?…

c type-conversion printf implicit-conversion ieee-754

asked Aug 27 '18 at 09:32

M.aster

-2

votes

1 answer

How do I truncate the significand of a floating point number to an arbitrary precision in Java?

I would like to introduce some artificial precision loss into two numbers being compared to smooth out minor rounding errors so that I don't have to use the Math.abs(x - y) < eps idiom in every comparison involving x and y. Essentially, I want…

java floating-point precision ieee-754

asked Feb 11 '18 at 01:41

Kevin Jin

1,536
4
18
20

Prev 1 2 3

…

96 97 Next