Simulating Floating Point Multiplication in C using Bitwise Operators

Question

I have to write a program that will simulate floating point multiplication. For this program, we assume that a single precision floating point number is stored in unsigned long a. I have to multiply the number stored in a by 2 using only the following operators: << >> | & ~ ^

I understand the functions of these operators, but I'm confused on the logic of how to go about implementing this. Any help would be greatly appreciated.

Write here your first implemantion code! What're your dubts and/or issues!? — Sir Jo Black, Feb 09 '19 at 21:21
Multiplying by 2 means that you need to add 1 to the exponent. To do that, you'll need to understand the [format of a floating point number](https://en.wikipedia.org/wiki/Single-precision_floating-point_format#IEEE_754_single-precision_binary_floating-point_format:_binary32). — user3386109, Feb 09 '19 at 21:21
are you sure you cannot use the addition ? (please do not consider I think about number+number, I understand the problem and its solution) — bruno, Feb 09 '19 at 21:30
"program that will simulate floating point multiplication." Does code need to handle all `float` values? If not, what subset of `float` does code need to simulate? — chux - Reinstate Monica, Feb 10 '19 at 00:46
@chux I believe that since we are given an `unsigned long` to simulate a float value with a single point of precision, we're supposed to handle all that could be simulated — , Feb 10 '19 at 00:49

score 3 · Answer 1 · edited Jun 20 '20 at 09:12

have to multiply the number stored in a by 2 using only the following operators: << >> | & ~ ^

since we are given an unsigned long to simulate a float value with a single point of precision, we're supposed to handle all that could be simulated. ref

First let's us assume the float is encoded as binary32 and that unsigned is 32-bit. C does not require either of these.

First isolate the exponent to deal with the float sub-groups: sub-normal, normal, infinity and NAN.

Below is some lightly tested code - I'll review later, For now consider it a pseudo code template.

#define FLT_SIGN_MASK  0x80000000u
#define FLT_MANT_MASK  0x007FFFFFu
#define FLT_EXPO_MASK  0x7F800000u
#define FLT_EXPO_LESSTHAN_MAXLVAUE(e)   ((~(e)) & FLT_EXPO_MASK)
#define FLT_EXPO_MAX   FLT_EXPO_MASK
#define FLT_EXPO_LSBit 0x00800000u

unsigned increment_expo(unsigned a) {
  unsigned carry = FLT_EXPO_LSBit;
  do {
    unsigned sum = a ^ carry;
    carry = (a & carry) << 1;
    a = sum;
  } while (carry);
  return a;
}

unsigned float_x2_simulated(unsigned x) {
  unsigned expo = x & FLT_EXPO_MASK;
  if (expo) { // x is a normal, infinity or NaN
    if (FLT_EXPO_LESSTHAN_MAXLVAUE(expo)) { // x is a normal
      expo = increment_expo(expo);  // Double the number
      if (FLT_EXPO_LESSTHAN_MAXLVAUE(expo)) { // no overflow
        return (x & (FLT_SIGN_MASK | FLT_MANT_MASK)) | expo;
      }
      return (x & FLT_SIGN_MASK) | FLT_EXPO_MAX;
    }
    // x is an infinity or NaN
    return x;
  }
  // x is a sub-normal
  unsigned m = (x & FLT_MANT_MASK) << 1;  // Double the value
  if (m & FLT_SIGN_MASK) {
    // Doubling caused sub-normal to become normal
    // Special code not needed here and the "carry" becomes the 1 exponent.
  }
  return (x & FLT_SIGN_MASK) | m;
}

UV of course, having a slow Sunday? – chqrlie Feb 10 '19 at 12:33 — chqrlie, Feb 10 '19 at 12:33
@chqrlie Just livin' the dream. – chux - Reinstate Monica Feb 10 '19 at 17:39 — chux - Reinstate Monica, Feb 10 '19 at 17:39

Sir Jo Black · Answer 2 · 2019-02-10T00:53:02.777

1

This is a simple code using the + operator. It doesn't pretend to cover all aspect of floating point elaboration. This solution show you that incrementing of 1 the esponent of a single precision floating point, bits 23-29 (30 is the exponent sign), you obtain multiplication by 2.

This code uses bitwise operator only to consider sign bits and to avoid eventual exponent overflow.

#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>

#include <inttypes.h>

int main()
    {
        float f = 23.45F;

        uint32_t *i=(uint32_t *)(&f);
        uint32_t app;

        printf("%08X %f\n",*i,f);

        app = *i & (0xC0000000); // copies bits 31 and 30
        *i += (1U<<23);
        *i &= ~(0xC0000000);     // leave bits 31 and 30
        *i |= app;               // set original bits 31 and 30


        printf("%08X %f\n",*i,f);

        return 0;
    }

See also: Wikipedia Single-precision floating-point

edited Feb 10 '19 at 00:53

answered Feb 09 '19 at 21:44

Sir Jo Black

2,024
2
15
22

the OP didn't not yet confirm the + is allowed, currently he says only "<< >> | & ~ ^" are – bruno Feb 09 '19 at 21:50
I know, but ... I don't see a simple way to do that with bitwise operators ... I've to think! :) – Sir Jo Black Feb 09 '19 at 21:50
1

you do not want to use a `int exp = ... ; switch (exp) { ... case 23: exp = 24; break; case 24: exp = 25; break; ... } ...` or equivalent, so strange ^^ – bruno Feb 09 '19 at 21:53
I think also he needs bitwise operator to manage signs because adding 1 he can obtain an overflow that modifies sign (exponent and number sign). ;) – Sir Jo Black Feb 09 '19 at 21:55
1

the _&_ is allowed to do the mask – bruno Feb 09 '19 at 21:56
1

@bruno thanks for the answer and help guys, but Jo's right, addition is not allowed. This is one reason why it's been puzzling me so much lol – Feb 09 '19 at 22:02
@vastImmortalSuns if really the + is not allowed and you do not want to use the switch case of my joke you can have `int incr[] = { 1, 2, 3, 4 ..., 254, 255, 256};` and use it to increment the exp. It is also possible to implement the incr by hand with only the allowed operators – bruno Feb 09 '19 at 22:18
There's a solution using XOR and Carry! ;) – Sir Jo Black Feb 09 '19 at 22:32
1

Fails for values near `FLT_MAX`, NAN, sub-normals, perhaps 0. Anti-aliasing issues apply as well as poor portability. – chux - Reinstate Monica Feb 10 '19 at 00:28

Sir Jo Black · Answer 3 · 2019-02-10T00:51:50.243

1

Here is my code that uses bitwise operators.

This code multiply by 2 a single precision floating point increasing by 1 the floating point exponent and uses only bitwise operators; furthermore takes care of exponent and number signs (bits 30 and 31).

It doesn't pretend to cover all aspect of floating point elaboration.

Remember that if the bits 30 and/or 31 are changed by the code we had an overflow.

#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>

#include <inttypes.h>

int main()
{
    float f = -23.45F;

    uint32_t *i=(uint32_t *)(&f);
    uint32_t sgn;
    uint32_t c,sc;

    printf("%08X %f\n",*i,f);

    sgn = *i & (0xC0000000); // copies bits 31 and 30

    c = *i & (1U<<23);
    *i ^= (1U<<23);

    while(c)
    {
        sc = c << 1;
        c = *i & sc;
        *i ^= sc;
    };

    if (sgn != *i & (0xC0000000)) {
       puts("Exponent overflow");
    }

    printf("%08X %f\n",*i,f);

    return 0;
}

See also: Wikipedia Single-precision floating point

edited Feb 10 '19 at 00:51

answered Feb 09 '19 at 22:24

Sir Jo Black

2,024
2
15
22

Many of the [same short-comings](https://stackoverflow.com/questions/54610832/simulating-floating-point-multiplication-in-c-using-bitwise-operators#comment96018621_54611083) - likely of scant concern for OP, but not a general solution. – chux - Reinstate Monica Feb 10 '19 at 00:29
@chux The code is an example and is to demonstrate how to increment the exponent of the float, not to cover all the aspect of the floating point management. – Sir Jo Black Feb 10 '19 at 00:40
1

Agree code does not cover all aspects, yet since the answer did not express any limitations, I commented to indicate various short-comings. – chux - Reinstate Monica Feb 10 '19 at 00:44

njuffa · Answer 4 · 2019-02-15T01:59:48.263

Function fpmul_by_2() below implements the desired functionality, under the assumptions that 'unsigned long' is a 32-bit integer type and 'float' is a 32-bit floating-point type mapped to IEEE-754 'binary32'. It is further assumed that we are to mimic IEEE-754 multiplication with exceptions disabled, producing the masked response prescribed by the standard.

Two helper functions are used that implement 32-bit integer addition and comparison for equality, respectively. The addition is based on the definition of sum and carry bits in binary addition (see this previous question for a detailed explanation), while equality comparison makes use of the fact that (a^b) == 0 iff a == b.

The processing of the floating-point argument needs to broadly distinguish three classes of operands: Denormals and zeros, normals, infinity and NaNs. Doubling of normals is accomplished by bumping the exponent, since we operate on a binary floating-point format. Overflow can occur, in which case infinity must be returned. Infinity and NaNs are returned unchanged, except that SNaNs are converted to QNaNs, which is the IEEE-754 prescribed masked response. Denormals and zeros are handled by literally doubling the significand. The handling of zeros, subnormals, and infinities may destroy the sign bit, so the sign bit of the argument is forced on the result.

The test framework included below tests fpmul_by_2() exhaustively, which will only take a couple of minutes on a modern PC. I used the Intel compiler on a x64 platform running Windows.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

// assumptions:
// 'unsigned long' is a 32-bit type 
// 'float' maps to IEEE-754 'binary32'. Exceptions are disabled

// add using definition of sum and carry bits in binary addition
unsigned long add (unsigned long a, unsigned  long b)
{
    unsigned long sum, carry;
    carry = b;
    do {
        sum = a ^ carry;
        carry = (a & carry) << 1;
        a = sum;
    } while (carry);
    return sum;
}

// return 1 if a == b, else 0
int eq (unsigned long a, unsigned  long b)
{
    unsigned long t = a ^ b;
    // OR all bits into lsb
    t = t | (t >> 16);
    t = t | (t >>  8);
    t = t | (t >>  4);
    t = t | (t >>  2);
    t = t | (t >>  1);
    return ~t & 1;
}

// compute 2.0f * a
unsigned long fpmul_by_2 (unsigned long a)
{
    unsigned long expo_mask = 0x7f800000UL;
    unsigned long expo_lsb  = 0x00800000UL;
    unsigned long qnan_mark = 0x00400000UL;
    unsigned long sign_mask = 0x80000000UL;
    unsigned long zero      = 0x00000000UL;
    unsigned long r;

    if (eq (a & expo_mask, zero)) {             // subnormal or zero
        r = a << 1;                             // double significand
    } else if (eq (a & expo_mask, expo_mask)) { // INF, NaNs
        if (eq (a & ~sign_mask, expo_mask)) {   // INF
            r = a;
        } else {                                // NaN
            r = a | qnan_mark;                  // quieten SNaNs
        }
    } else {                                    // normal
        r = add (a, expo_lsb);                  // double by bumping exponent
        if (eq (r & expo_mask, expo_mask)) {    // overflow
            r = expo_mask;
        }
    }
    return r | (a & sign_mask);                 // result has sign of argument
}

float uint_as_float (unsigned long a)
{
    float r;
    memcpy (&r, &a, sizeof r);
    return r;
}

unsigned long float_as_uint (float a)
{
    unsigned long r;
    memcpy (&r, &a, sizeof r);
    return r;
}

int main (void)
{
    unsigned long res, ref, a = 0;
    do {
        res = fpmul_by_2 (a);
        ref = float_as_uint (2.0f * uint_as_float (a));
        if (res != ref) {
            printf ("error: a=%08lx  res=%08lx  ref=%08lx\n", a, res, ref);
            return EXIT_FAILURE;
        }
        a++;
    } while (a);
    printf ("test passed\n");
    return EXIT_SUCCESS;
}

@chux Right you are. Brain fart. Will fix now. – njuffa Feb 15 '19 at 01:59 — njuffa, Feb 15 '19 at 01:59

Simulating Floating Point Multiplication in C using Bitwise Operators

4 Answers4

Linked