Is this correct first
The Java float
format is IEEE-754 binary32. In this format, every finite number is represented as a sign, a 24-bit integer, and a scaling by a power of two from 2−149 to 2104. The integer part is called the significand. (The format is often described as a sign, a 24-bit number with a binary point after the first bit, so it has a value in [0, 2), and a scaling from 2−126 to 2127. These are mathematically equivalent, and the format used here is noted in the IEEE-754 standard as an option.) In normal form, the 24-bit integer is 223 or greater. (Representable numbers less than 2−126 cannot be represented in normal form and are necessarily subnormal.)
In this format, 590 can be represented as +590•20 or +8,339,456•2−14. 490 is +490•20 or +16,056,320•2−15.
Their product is +289,100•20 or +9,251,200•2−5.
391 is +391•20 or +12,812,288−15.
The ordinary arithmetic product of +289,100•20 and +391•20 is +113,038,100•20. However, 113,038,100 is not a 24-bit number; it is a 27-bit number. To get it under 224, we can adjust the scaling, multiplying the significand by ⅛ and multiplying the scaling by 8 = 23.
That gives us +14,129,762.5•23. However, now the significand is not an integer. This result is not representable in the float
format. To produce a result, the operation of adding in the float
format is defined to round the ordinary arithmetic to the nearest representable value. In this case, there is a tie, we could round the .5 up or down. Ties are resolved by rounding to make the low digit even, so we round to +14,129,762•23.
+14,129,762•23 is 113,038,096. That is the result you got, so it is correct.
Does always float gives wrong numbers ??
This is not wrong; the computer behaved according to its specification.
Observe float
is a 32-bit format, but there are infinitely many real numbers. There are even infinitely many rational numbers. It is impossible for a 32-bit format to produce the same results as theoretical real-number arithmetic or rational-number arithmetic. There are simply more possible results than there are representable values.
This is true of the 64-bit double
format as well. It is also true of integer formats, fixed-precision formats, and all numerical formats with a fixed number of bits. A fixed number of bits cannot represent infinitely many values.
Your comments suggest you thought floating-point would produce approximate results for fractional values, numbers less than one. But the limitation on how many values can be represented applies at all scales. At each scale (each power of two), only 224 values are representable (223 in normal form). For scale 20, all the non-negative integers below 224 are representable. But, above that, only some of the integers are representable. At first, we have to skip every second integer, then every fourth, then every eighth, and so on.
Floating-point arithmetic is designed to approximate real-number arithmetic. It should be used when you want to approximate real-number arithmetic. It should not be used, with rare exceptions, when you want exact arithmetic.