Questions tagged [ieee-754]

IEEE 754 is the most common & widely used floating-point standard, notably the single-precision binary32 aka float and double-precision binary64 aka double formats.

IEEE 754 is the Institute of Electrical and Electronics Engineers standard for floating-point computation, and is the most common & widely used implementation thereof.

As well as formats, IEEE754 also defines the basic operations, + - * / and sqrt, as producing correctly-rounded results (error <= 0.5ulp). Other functions like pow and sin are not required to be as accurate; that's an implementation choice between precision and performance.

This is why many CPU instruction sets only include the basic operations (including sqrt).

1447 questions
-1
votes
1 answer

Looking for faster method of doing this double function

I am looking to speed up a function that maps one double number to another double number. However the function must remain the same. Same same input must produce exactly the same output as before. The reason for this is we don't want to introduce…
steviekm3
  • 905
  • 1
  • 8
  • 19
-1
votes
4 answers

C++ pow() function produces arbitrarily precise result

If double precision does not guarantee more than 16 significant decimal digits, how is such an output generated by this standard C++ program? Also small value change operations done on "ans" such as ++ans don't alter the screen output. Is the answer…
-1
votes
1 answer

how to convert floating-point number to IEEE 754 using assembly

can you please help me to convert floating-point number to IEEE 754 using assembly i have this number -1.75 and i know it equla to -1.11000000000000000000000 E+0 on IEEE754 but i dont know how to do the convert in assembly
Sideeq Youssef
  • 903
  • 2
  • 10
  • 24
-1
votes
1 answer

Convert binary or hexadecimal string using php into 32-bit float value. Big endian \ Little endian

How can I convert 32-bit binary string like 00111001101010000101110000100010 or hexadecimal string like 39a85c22 into float value? For use in zend framework.
shukshin.ivan
  • 11,075
  • 4
  • 53
  • 69
-1
votes
2 answers

How tricky is floating point in storing a value in memory

Say,I have to store 2147483648 as a float(not as a fixed-point number like integer) in a 32-bit system. For this what will be the mantissa (significand) and exponent ? And how this number is represented in memory?
Parveez Ahmed
  • 1,325
  • 4
  • 17
  • 28
-1
votes
1 answer

Printing error while accessing consecutive memory locations

Following is a code to see how different data types are stored in memory. #include void newline(void) { putchar('\n'); } void showbyte(char *string, int len) { int i; for (i = 0; i < len; i++) printf("%p\t0x%.2x\n",…
noufal
  • 940
  • 3
  • 15
  • 32
-1
votes
2 answers

Same floating point operation, different results

I really can't wrap my head around the fact that this code gives 2 results for the same formula: #include #include int main() { // std::cout.setf(std::ios::fixed, std::ios::floatfield); std::cout.precision(20); float a =…
user2485710
  • 9,451
  • 13
  • 58
  • 102
-1
votes
2 answers

Analyzing IEEE 754 bit patterns

I'm working on an assignment but I'm stuck. For some reason I can't get this outcome: byte order: little-endian > FFFFFFFF 0xFFFFFFFF signBit 1, expBits 255, fractBits 0x007FFFFF QNaN > 3 0x00000003 signBit 0, expBits 0, fractBits…
numbplum
  • 37
  • 1
  • 5
-2
votes
2 answers

1.0 / 0.0 - valid statement?

I just wanted to apply an infinity load-factor to a std::set<> because I wanted to have a fixed number of buckets. So I used a load-factor of 1.0f / 0.0f because it's shorter to write than numeric_limits::infinity(). MSVC give an error…
Bonita Montero
  • 2,817
  • 9
  • 22
-2
votes
1 answer

C# NumericUpDown with IEEE754 Single Precision Resolution

I want to be able to construct a numericUpDown field in C# where on each arrow click the value in the field increases by one resolution. For example; Let's say my value is 20.0, and it's IEEE754 hexadecimal representation is 0x41a00000. If I click…
efedoganay
  • 133
  • 2
  • 11
-2
votes
1 answer

Why does casting float64 to int32 give a negative number in Go?

Occasionally, we cast float64 to int32 directly by mistake in Golang raw = 529538871408 fv = float64(raw) fmt.Println(raw) fmt.Println(fv) fmt.Println(int32(fv)) Output 529538871408 5.29538871408e+11 -2147483648 Why…
zangw
  • 43,869
  • 19
  • 177
  • 214
-2
votes
1 answer

ATmega64a float to IEEE-754 unexpected result

I am trying to convert a float to an IEEE-754 Hex representation. The following code works on my Mac. #include #include union Data { int i; float f; }; int main() { float var = 502.7; union Data value; …
user1757006
  • 705
  • 2
  • 12
  • 23
-2
votes
1 answer

Java: IEEE Doubles to IBM Float

I am working on a side project at work where I would like to read/write SAS Transport files. The challenge is that numbers are encoded in 64-bit IBM floating point numbers. While I have been able to find plenty of great resources for reading a byte…
Travis Parks
  • 8,435
  • 12
  • 52
  • 85
-2
votes
2 answers

can somebody explain why does my first print 0, but after *p = 6.35, it can print 6.35?

#include void print_binary(int n); void test(); int main(){ test(); return 0; } void print_binary (int n){ unsigned int mask = 0; mask = ~mask^(~mask >> 1); for (; mask != 0; mask >>= 1){ putchar ((n & mask) ?…
-2
votes
1 answer

How do I truncate the significand of a floating point number to an arbitrary precision in Java?

I would like to introduce some artificial precision loss into two numbers being compared to smooth out minor rounding errors so that I don't have to use the Math.abs(x - y) < eps idiom in every comparison involving x and y. Essentially, I want…
Kevin Jin
  • 1,536
  • 4
  • 18
  • 20