Single precision in Fixed Point Arithmetic

Question

I need up to 6 decimal places precision for a Taylor series calculation using fixed point arithmetic. I have tried different fixed point format for achieving 6 decimal places precision.

For example, Using s16.15 (Left shift by 15) format I have got up to 2 decimal places precision.1 sign bit,16 integer bits and 15 fraction bits.

For s8.23 (Left shift by 23) format up to 4 decimal places and with s4.27 (Left shift by 27) format the precision is still the same. I was expecting the situation will improve.

The following is a Taylor Series expansion to calculate natural logarithm around a certain point a.

So q=x-a, x is the user input between 1 and 2.

  // These are converted constants into s4.27 fixed point format
  const int32_t con=0x0B8AA3B3; //1.44269504088895
  const int32_t c0=0x033E647E; //0.40546510810816
  const int32_t c1=0x05555555; //0.66666666666666
  const int32_t c2=0x01C71C72; //0.222222222222
  const int32_t c3=0x00CA4588; //0.0987654321
  const int32_t c4=0x006522C4; //0.04938271605
  const int32_t c5=0x0035F069; //0.02633744856
  const int32_t c6=0x001DF757; //0.01463191587

//Expanded taylor series    
taylor=c0+mul(q,(c1-mul(q,(c2+mul(q,(c3-mul(q,(c4-mul(q,(c5+mul(q,c6)))))))))));
// Multiplication function
int32_t mul(int32_t x, int32_t y)
{
int32_t mul;
mul=((((x)>>13)*((y)>>13))>>1); // for s4.27 format, the best possible right shift
return mul;
}

Above mentioned code snippets were used in C.

Result I need: 0.584963 but the result I got is: 0.584949

How can I achieve more precision?

Are you sure the error you're seeing is due to limited precision, as opposed to the fact that you're using an approximation formula? — Oliver Charlesworth, Nov 19 '17 at 00:11
However, it appears that your `mul` function is immediately dropping 13 bits of each argument, which is probably pretty bad for precision. — Oliver Charlesworth, Nov 19 '17 at 00:23
(a) State what the input is. (b) Provide a [Minimal, Complete, and Verifiable Example](https://stackoverflow.com/help/mcve). (c) Do the calculation with `double` arithmetic or 64-bit integer arithmetic to test whether the problem is due to the precision of the arithmetic or the quality of the polynomial you are using. It could also help to state the original function you are approximating with a Taylor series, so people could check your coefficients. — Eric Postpischil, Nov 19 '17 at 00:27
This seems to be some sort of computation related to logarithms (`c0 == log(1.5)` but I can't tell what is being computed. To maximize precision in such fixed-point computation one usually (1) varies the fixed-point format from term to term; this requires knowing the input domain to avoid overflows (2) uses a `mulhi` instruction or wide multiply (with full double-width product) so all bits of the original argument enter into each product. You'd be better off using a minimax approximation instead of a Taylor series expansion. — njuffa, Nov 19 '17 at 02:32
The comments do not reflect code. `c2=0x01C71C72;` --> `0.222222223878...`, not `0.222222222222` — chux - Reinstate Monica, Nov 19 '17 at 03:33
Good catch. The comment implies log(1.5), but the fixed-point value looks like 0.6165... — njuffa, Nov 19 '17 at 03:46
@OliverCharlesworth My task is to find out which approximation formula works better. There are better approximation formulas but I have to check every possibilities. Thanks for your tips. EricPostpischil: Thanks for the suggestion. I would try to arrange and post accordingly. njuffa: After taylor I will have to go for minimax approximation as well. Thanks. chux: I have updated the value of q and log base 2 of 1.5 is 0.584963. — Shaown, Nov 20 '17 at 00:08

chux - Reinstate Monica · Accepted Answer · 2017-11-19T16:13:48.763

OP's mul() throws away too much precision.

(x)>>13)*((y)>>13) immediately discards the least significant 13 bits of x and y.

Instead, perform a 64-bit multiply

int32_t mul_better(int32_t x, int32_t y) {
  int64_t mul = x;
  mul *= y;
  mul >>= 27;

  // Code may want to detect overflow here (not shown)

  return (int32_t) mul;
}

Even better, round the product to nearest (ties to even) before discarding the least significant bits. Simplifications are possible. Verbose code below as it is illustrative.

int32_t mul_better(int32_t x, int32_t y) {
  int64_t mul = x;
  mul *= y;
  int32_t least = mul % ((int32_t)1 << 27);
  mul /= (int32_t)1 << 27;
  int carry = 0;
  if (least >= 0) {
    if (least >  ((int32_t)1 << 26) carry = 1;
    else if ((least ==  ((int32_t)1 << 26)) && (mul % 2)) carry = 1;
  } else {
    if (-least > ((int32_t)1 << 26) carry = -1;
    else if ((-least ==  ((int32_t)1 << 26)) && (mul % 2)) carry = -1;
  }
  return (int32_t) (mul + carry);
}

int32_t mul(int32_t x, int32_t y) {
  int64_t mul = x;
  mul *= y;
  return mul >> 27;
}

void foo(double x) {
  int32_t q = (int32_t) (x * (1 << 27));  // **
  int32_t taylor =
      c0 + mul(q, (c1 - mul(q, (c2  + mul(q,
      (c3 - mul(q, (c4 - mul(q, (c5 + mul(q, c6)))))))))));
  printf("%f %f\n", x,  taylor * 1.0 / (1 << 27));
}

int main(void) {
  foo(0.303609);
}

Output

0.303609 0.584963

** Could round here too rather than simply truncate the FP to an integer.

Single precision in Fixed Point Arithmetic

1 Answers1