0

I have a problem with understanding fixed-point arithmetic and its implementation in C++. I was trying to understand this code:

#define scale 16

int DoubleToFixed(double num){
    return num * ((double)(1 << scale));
}

double FixedToDoble(int num){
    return (double) num / (double)(1 << scale);
}

double IntToFixed(int num){
    return x << scale
}

I am trying to understand exactly why we shift. I know that shifting to the right is basically multiplying that number by 2x, where x is by how many positions we want to shift or scale, and shifting to the left is basically division by 2x.

But why do we need to shift when we convert from int to fixed point?

Remy Lebeau
  • 555,201
  • 31
  • 458
  • 770
th3plus
  • 161
  • 2
  • 11
  • 3
    c/c++ isn't a thing. Pick a language (these are different), and provide a [mcve] regarding your concern as absolutely required here. – πάντα ῥεῖ Nov 11 '22 at 16:59
  • 2
    `1 << scale` for that value of `scale` does nasty things on a system with a 16 bit `int`. The author needs a good talking to. Just write 65536, or 0x10000, pretty please with sugar on top. – Bathsheba Nov 11 '22 at 17:01
  • 1
    Fixed-point arithmetic typically involves multiplying and dividing by a *scale factor*. Theoretically the scale factor can be anything, although typically it is either a power of 10 or a power of two. `1< – Steve Summit Nov 11 '22 at 17:07
  • @πάνταῥεῖ i just mentioned that i am implementing it on c/c++, i just want to understand the idea of shifting bits in fixed point arithmetic,.......thanks – th3plus Nov 11 '22 at 17:09
  • @πάνταῥεῖ i am saying that the implementation is the same, in c or c++...thanks – th3plus Nov 11 '22 at 17:11
  • 1
    Careful with that assumption. I've seen some very poorly chosen operator overloads in C++ code. – user4581301 Nov 11 '22 at 17:15
  • @user4581301 and as a shifting operator in this meaning – th3plus Nov 11 '22 at 17:16
  • Just to hammer home what the answers and some of the other comments are suggesting, the shift itself is not important to fixed point. It is just a means to an end, producing a multiplier. The number produced by the shift is what is important. – user4581301 Nov 11 '22 at 17:21
  • If it is any help, I'm making a programming language called **C/C++**, which is based on **OCaml**. I haven't released it yet, but I can assure you the above code is not **C/C++** compliant. That code looks more like **C** code. – Eljay Nov 11 '22 at 17:21
  • @Eljay you can try it, anyway thank you, – th3plus Nov 11 '22 at 17:22
  • Pendantic: Shifting right is division, shifting left is multiplication. Your understanding is not correct. Shifing left: `1 << 5`, shifting right: `5 >> 2`;. – Thomas Matthews Nov 11 '22 at 17:30
  • 1
    @πάνταῥεῖ: A minimal reproducible example is needed for debugging questions, not for questions like this one about what a particular piece of code does. It is not “absolutely required.” – Eric Postpischil Nov 11 '22 at 17:31
  • Imagine you have the fixed point number `456` and it represents `4.56`. You'll need to pack `4.56` into `456` (that's like what `DoubleToFixed` does). And to use it, you'll need to unpack the `456` into `4.56` (that's like what `FixedToDoble` does). – Eljay Nov 11 '22 at 17:36
  • Please take a look at [ldexp](https://man7.org/linux/man-pages/man3/ldexp.3.html) for a cleaner solution. – rici Nov 11 '22 at 18:05

2 Answers2

2

A fixed-point format represents a number as an integer multiplied by a fixed scale. Commonly the scale is some base b raised to some power e, so the integer f would represent the number fbe.

In the code shown, the scale is 2−16 or 1/65,536. (Calling the the shift amount scale is a misnomer; 16, or rather −16, is the exponent.) So if the integer representing the number is 81,920, the value represented is 81,920•2−16 = 1.25.

The routine DoubleToFixed converts a floating-point number to this fixed-point format by multiplying by the reciprocal of the scale; it multiplies by 65,536.

The routine FixedToDouble converts a number from this fixed-format to floating-point by multiplying by the scale or, equivalently, by dividing by its reciprocal; it divides by 65,536.

IntToFixed does the same thing as DoubleToFixed except for an int input.

Eric Postpischil
  • 195,579
  • 13
  • 168
  • 312
  • A thorough chap such as your good self might want to point out the pitfalls of writing `1 << 16`. To me it fits into the `unsigned n = -1e1;` category of poor things to do. – Bathsheba Nov 11 '22 at 17:50
  • @Eric Postpischil what I don't understand exactly is why we even need to scale in the first place even to convert a normal integer value to a fixed-point – th3plus Nov 11 '22 at 19:59
  • @th3plus: In the fixed-point format, , the number is being used to represent a scaled value. If had some integer, say 32,768, and stored it in the format without adjusting it, then 32,768 would represent 32,768/65,536 = ½. Remember, even if the data for the format is stored as an integer type, it is not being used as an integer. It **represents** a different number, one that is scaled from the stored integer. When putting a value into the format, you must adjust it for this scaling. – Eric Postpischil Nov 12 '22 at 13:54
-1

Fixed point arithmatic works on the concept of representing numbers as an integer multiple of a very small "base". Your case uses a base of 1/(1<<scale), aka 1/65536, which is approximately 0.00001525878.

So the number 3.141592653589793, could be represented as 205887.416146 units of 1/65536, and so would be stored in memory as the integer value 205887 (which is really 3.14158630371, due to the rounding during conversion).

The way to calculate this conversion of fractional-value-to-fixed-point is simply to divide the value by the base: 3.141592653589793 / (1/65536) = 205887.416146. (Notably, this reduces to 3.141592653589793 * 65536 = 205887.416146). However, since this involves a power-of-two. Multiplication by a power-of-two is the same as simply left shifting by that many bits. So multiplication of 2^16, aka 65536, can be calculated faster by simply shifting left 16 bits. This is really fast, which is why most fixed-point calculations use an inverse-power-of-two as their base.

Due to the inability to shift float values, your methods convert the base to a float and does floating point multiplication, but other methods, such as the fixed-point multiplication and division themselves would be able to take advantage of this shortcut.

Theoretically, one can use shifting bits with floats to do the conversion functions faster than simply floating point multiplication, but most likely, the compiler is actually already doing that under the covers.

It is also common for some code to use an inverse-power-of-ten as their base, primarily for money, which usually uses a base of 0.01, but these cannot use a single shift as a shortcut, and have to do slower math. One shortcut for multiplying by 100 is value<<6 + value<<5 + value<<2 (this is effectively value*64+value*32+value*4, which is value*(64+32+4), which is value*100), but three shifts and three adds is sometimes faster than one multiplication. Compilers already do this shortcut under the covers if 100 is a compile time constant, so in general, nobody writes code like this anymore.

Mooing Duck
  • 64,318
  • 19
  • 100
  • 158
  • Re “Multiplication by a power-of-two is the same as simply left shifting by that many bits”: `DoubleToFixed` and `FixedToDouble` perform floating-point multiplications or divisions. The multiplication or division is not the same as shifting. – Eric Postpischil Nov 11 '22 at 17:22
  • Ah, yes, my explanation is math based, which, if translated directly to processors, would require an `operator*(float, int)`. The lack of such an operation doesn't invalidate my math, it merely means the conversion operations can't take direct advantage of the shifting. I'll note that explicitly. – Mooing Duck Nov 11 '22 at 17:25