Questions tagged [single-precision]

57 questions
0
votes
1 answer

Writing floating point numbers in single precision to file in R?

I understand that R doesn't have a floating point format in single precision. However, I'm writing a very large number of data points from R to file, and I'd like to store them as single precision floats, rather than double precision. I don't need…
0
votes
0 answers

Mips single to double floating point percision

I have this program in mips and I wasn't to change it to double precision. It looks like single and double precision floating instructions have the same instructions but instead of .s it is .d If anyone has commens or help it would help my out a…
0
votes
1 answer

Binary operations on Numpy scalars automatically up-casts to float64

I want to do binary operations (like add and multiply) between np.float32 and builtin Python int and float and get a np.float32 as the return type. However, it gets automatically up-casted to a np.float64. Example code: >>> a = np.float32(5) >>>…
jmd_dk
  • 12,125
  • 9
  • 63
  • 94
0
votes
1 answer

integer to single precision conversion in python

I can't figure out how to convert an integer to a single precision using python, I already tried to use numpy but the float32() function doesn't help. example 984761996 -> 1.360135E-3
0
votes
1 answer

MIPS - How to Convert a set of Integers into Single-Precision Floats

I'm really having a difficult time figuring out how to approach this problem. I get that I want to take the binary representation of both the integer and fraction, combine them for the mantissa, and assign the sign bit to the beginning, but I don't…
0
votes
2 answers

Accuracy of c_k = a + ( N + k ) * b

a, b are 32 bit floating point values, N is a 32 bit integer and k can take on values 0, 1, 2, ... M. Need to calculate c_k = a + ( N + k ) * b; The operations need to be 32 bit operations (not double precision). The concern is accuracy -- which…
-1
votes
3 answers

Byte[] to float conversion

Float b = 0.995; Byte[] a = Bitconverter.GetBytes(b); Now my byte[] values are 82 184 126 63 .i.e., a[0] = 82, a[1] =184, a[2] = 126, and a[3] = 63. I want to revert back above byte to float.So,I used Bitconverter.Tosingle Float b =…
-1
votes
3 answers

How many different values can be encoded in IEEE 754 32-bit base-2 floating-point system?

The wikipedia page states that an IEEE 754 32-bit base-2 floating-point variable has a maximum value of (2 − 2−23) × 2127 ≈ 3.4028235 × 1038 In that number, are +∞, −∞ and NaN included? What is that 2 in "(2 − 2−23)"? Why 127 in 2127?
Joshua Leung
  • 2,219
  • 7
  • 29
  • 52
-1
votes
2 answers

Decimal to IEEE 754 Single-precision IEEE 754 code using C

We have an assignment in class to convert from Decimal to single precision using c and I'm completely lost. This is the assignment: The last part of this lab involves coding a short c algorithm. Every student must create a program that gets a…
-1
votes
1 answer

Reinterpret bytes as float in C (IEEE 754 single-precision binary)

I want to reinterpret 4 bytes as IEEE 754 single-precision binary in C. To obtain the bytes that represent float, I used: num = *(uint32_t*)&MyFloatNumber; aux[0] = num & 0xFF; aux[1] = (num >> 8) & 0xFF; aux[2] = (num >> 16) & 0xFF; aux[3] = (num…
Guilherme
  • 35
  • 1
  • 8
-1
votes
1 answer

How is single precision floating point number subtraction is done?

Here is the example (I have converted them to decimal in advance). A is 01000001000010000000000000000000^2 (in decimal 8.5) B is 01000000000100000000000000000000^2 (in decimal 2.25) The ((+A)-(+B)) should be 6.25 in decimal. Normalizing A and B and…
Sanone
  • 11
  • 2
-1
votes
3 answers

Add Two 32 bit Floating Point Numbers with AVR-Assembler

Im trying to use AVR Studio to add two 32bit floating point numbers together. I know that I will need to store the 32bit number in 4 separate 8bit registers. I'll then need to add the registers together using the carry flag. This is what I have so…
Supercreature
  • 441
  • 6
  • 25
1 2 3
4