3

I need to divide a set of 5U11 numbers by 6.02 and would prefer to do this without casting to float and back.

5U11 means a 16-bit unsigned number, with the 11 least significant bits representing fractional part.

How should I represent 6.02 and what would be the upper bond of the error of a single calculation?

Vorac
  • 8,726
  • 11
  • 58
  • 101

2 Answers2

2

A simple scaling by 100 will suffice.

uns16_t x_5U11;
uns32_t acc;
acc = x_5U11;
acc *= 100;
acc += 301; // for round to nearest rather than truncation.
acc /= 602;

Error bound: 1/2 LSbit in x_5U11.

--

If speed is most important, than performing a multiple and divide (by shifting) as suggested by @alastai is the way to go. With proper rounding, the answer should be within +/- 1 LSBit.

If accuracy is most important than this method provides +/- 1/2 LSbit (best possible answer).

[Edit] Thanks to @Ingo Leonhardt for pointing out I had an inverted solution.

chux - Reinstate Monica
  • 143,097
  • 13
  • 135
  • 256
  • It’s also a very slow way to multiply, which is why you would never do the above in practice. Most likely you’d write 6.02 in 5U11 (or maybe 3U13), do the multiplication into a 32 bit word and use a shift to rectify the answer. – al45tair Feb 25 '14 at 15:58
  • 1
    @alastair Agree the tis slower to divide by 602 than shift. But speed is not OP's stated goal. The above creates a well defined most accurate answer. Its the classic question speed vs. accuracy. – chux - Reinstate Monica Feb 25 '14 at 16:01
  • Accuracy hardly seems relevant given that this answer is just plain wrong (it’s a multiply rather than a divide). Additionally, fixed point multiplication is quite easy to characterise (and if you wanted to do round to nearest, you could just add `0x8000` after the multiplication). – al45tair Feb 25 '14 at 16:15
  • @alastai This answer is correct. Provide example that failed to back-up your assertion. Op asked "what would be the upper bond of the error of a single calculation"? Thus accuracy is relevant to the OP. By dividing, the error is held to with in 1/2 LSBit - the best possible answer. Multiplying by `(2^18 / 6.02)` and then shifting (with or without a rounding factor) does not always provide the best answer. Try 3.0 or 0x1800 in OP's 5U11. My answer provides 0x03FD (0.49853...). Your method provides 0x3FC (0.49804...) and 3/6.02 is 0.49833... – chux - Reinstate Monica Feb 25 '14 at 18:18
  • 1
    My apologies — I hadn’t spotted that you’d updated the text in the answer to divide instead of multiply. Also, your example of *my* method is incorrect; if you’re shifting by 18 bits rather than 16 bits, you should add 0x20000, not 0x8000, and in that case “my” method provides 0x03fd also. – al45tair Feb 26 '14 at 08:42
2

The most straightforward approach to this problem is to calculate the inverse of 6.02 as a 16 bit quantity; i.e. compute round(2^16 / 6.02) = 0x2a86. Note that the top bit is not set, so we can choose a higher dividend and recompute for better accuracy; in this case, round(2^18 / 6.02) = 0xaa1a.

Now, take your 5U11 number and do a 16x16 to 32-bit widening multiply, then shift right by (in this case) 18 bits to get your result, as a 5U11 value.

For example:

14.3562 * (2^18 / 6.02) = 625148.122 / 2^18 = 2.384
0x72d9  * 0xaa1a        = 0x4c4fc40a >> 18  = 0x1313

You do lose a little accuracy doing things this way, and it’s possible to improve slightly on this naïve method (see the book Hacker’s Delight by Henry S. Warren for quite a bit more on this topic and other useful things).

Plainly if you have a machine that is capable of doing wider multiplications, you can increase the size of the dividend further than 2^18, which will increase your accuracy.


Updates

Rounding

If you want to round to nearest, you should add d / 2 where d is your dividend (so in the example above, the dividend is 2^18, hence the rounding value is 2^17 or 0x20000.

Error analysis

Given the small domain, it’s simplest to do an exhaustive search to establish the maximum error. With the example above and using round-to-nearest by adding 0x20000, the maximum error turns out to be at x = 0xfa19:

0xfa19 * 0xaa1a + 0x20000 = 0xa62e008a >> 18 = 0x298c

The actual answer should be

31.2622 / 6.02 = 5.193058

while the answer we have is

0x298c * 2^-11 = 5.193359

The error in this instance is 0.000302, or 0.62 of an LSB.

Improving on these results

It’s possible to choose a more specific rounding constant to minimise the error bound; essentially this lets us compensate for the fact that our multiplicative inverse (here 0xaa1a) is not precise. In this particular instance, the best value seems to be around 0x1c200, which yields an error bound of 0.56 of an LSB.

Community
  • 1
  • 1
al45tair
  • 4,405
  • 23
  • 30
  • With a round of 0x8000 and 32-bit multiply, the largest error occurs at x = 31.158203125 (0xF944) with a result of 5.17529296875 (0x2967) which mathematically should be 5.17578125 (0x2968) or 1.0 bits off. – chux - Reinstate Monica Feb 25 '14 at 18:42
  • @chux That’s because you should have added 0x20000, not 0x8000 (because the shift chosen here is 18 bits, not 16 bits). So, actually, you get (0xf944 * 0xaa19 + 0x20000) >> 18 = 0x2968. – al45tair Feb 26 '14 at 08:45
  • I also tried `0x20000` (which does make more sense) and had an answer that was even more than 1.0 bits off, (about 1.2). will post later. – chux - Reinstate Monica Feb 26 '14 at 08:48
  • @chux Unless I'm mistaken, you're out by a factor of 2. Maximum error using the rounding value 0x20000 is 0.63 of an LSB (at 0xff6b = 31.9272). Still not perfect (I did note that you lose a little accuracy doing things this way), but less than 1 LSB. – al45tair Feb 26 '14 at 08:56
  • Using `f(unsigned x) { x *= 0xaa19; x += 0x2000; x >>= 18; return x;}` I too came up with 0xff6b = 31.9272 and a error of e:0.63. But at ffa4 = 31.955078, I came up with `e:1.096`. Perhaps I am not implementing you function correctly. Curiously, when I used 0x8000, the worst case was e 1.0 as previously noted. – chux - Reinstate Monica Feb 26 '14 at 09:08
  • @chux Did you miss a zero there? Should be `x += 0x20000;`, not `x += 0x2000`. – al45tair Feb 26 '14 at 09:17
  • I think I see why its > 1.0. Your scaling factor 43545.514... is truncated to 0xaa19 rather than rounded to 0xaa1a. By using this and a round factor of 0x2000, the worst e:0.97 bits at x=0x03c3 0.470215 – chux - Reinstate Monica Feb 26 '14 at 09:17
  • let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/48449/discussion-between-chux-and-alastair) – chux - Reinstate Monica Feb 26 '14 at 09:18
  • 1
    Added thought on offset calc: The difference between the multiplier used 0.166114807 (0xAA1A/2^18) and the correct multiplier 0.166112957 (1/6.02) imparts a linear bias at x=0 of 0 and at 65535 a bias of (mu - cm)*range or 0.1212 or (< 1/2 * 1/4). By adjusting the 0x20000 offset, this bias is split in 2. So now near 0, the bias is -0.0606 and near 65535, its +0.0606. Thus we should expect a worst case error at _about_ 0.5606 bits. The _approximate offset_ derives directly from 0.1212 = (0x20000 * (1 - 0.1212)). Have yet to derive _best offset_ except by experimentation. – chux - Reinstate Monica Feb 26 '14 at 16:49