bitwise stochastic rounding

Question

I have this piece of C code that does stochasting rounding of a binary64 value to a binary32. Problem is I don't quite fully understand the code. I know it operates directly on the bits of a floating-point number, but I can't grasp what's happening. Could you please share some insight with me?

float function(double x){
  uint64_t temp = *(uint64_t*)&x;
  uint32_t r = (rand() * (0xFFFFFFFF/RAND_MAX)) % 0x1FFFFFFF;
  temp += r;
  temp = temp & 0xFFFFFFFFE0000000;

  return (float)*(double *)&temp;
}

What do the bitmasks represent? (my intuition tells me it has to do with how the exponent and mantissa are represented in binary format but i can't visualise it)
Why is the random variable r calculated like that?
How would an interation through the code look like?

Seems platform-specific, but please expand to an example that compiles so that we can look at the machine code. — 500 - Internal Server Error, Feb 23 '21 at 10:23
`0xFFFFFFFFE0000000` covers 1 sign bit + 11 exponent bits + 23 mantissa bits of the `binary64`, where the last is the number of mantissa bits in a `binary32`. So except for overflow/underflow, the final value of `temp` should convert losslessly into a `binary32`. `0x1FFFFFFF` covers the 29 low-order mantissa bits of the `binary64` number, they are used for rounding. — njuffa, Feb 23 '21 at 10:26
The `r` thing just generates a random number between 0 and 0x1FFFFFFF -- the parens generate some mess, the modulo limits it to the range. Most implementations of `rand` are pretty crap, those are not going to be very good random numbers. — , Feb 23 '21 at 10:26
This kind of type punning idiom: `uint64_t temp = *(uint64_t*)&x;` invokes undefined behavior. I would suggest using `memcpy()` instead; modern compilers know how to optimize this away. — njuffa, Feb 23 '21 at 10:30
@njuffa you're explaination is very insightful, thanks a lot. Similarly, thank you to all that commented — notgrumpy, Feb 23 '21 at 12:42

Eric Postpischil · Accepted Answer · 2021-02-23T18:26:04.777

uint64_t temp = (uint64_t)&x;

This is a bad attempt to get the bits that represent the double x. It is bad because it violates C’s aliasing rule (C 2018 6.5 7). Correct code would be uint64_t temp; memcpy(&temp, &x, sizeof temp); or uint64_t temp = (union { double d; uint64_t u; }) { x } .d;. The former copies the bytes of x into temp, and the latter uses a compound literal to create a temporary object that is a union, which it uses to reinterpret the bits. Both of these are supported by the C standard.

uint32_t r = (rand() * (0xFFFFFFFF/RAND_MAX)) % 0x1FFFFFFF;

* (0xFFFFFFFF/RAND_MAX)) attempts to scale the result of rand to the interval [0, FFFFFFFF₁₆]. It may do so imperfectly. Then % 0x1FFFFFFF reduces this to the interval [0, 1FFFFFFF₁₆). Note the closing ) versus ]—this is a half-open interval that does not include 1FFFFFFF₁₆. There are some issues here:

This may be a typo; & 0x1FFFFFFF would have cleanly extracted the low 29 bits, producing a result in the fully closed interval [0, 1FFFFFFF₁₆]. Using % has a different result without apparent mathematical purpose and forces a time-consuming division.
With either % or &, there is no apparent reason for scaling to FFFFFFFF₁₆ first; one might have gone directly for the desired final interval.
This produces only positive results; the number will only ever be increased or unchanged in magnitude, never decreased. That might be desired, but it is unclear why. The lack of documentation on this and other points suggests a lack of code quality.

temp += r;

This adds the random number to the low bits of the double. Some of the time, it will cause a carry into the upper bits. (If the upper bits are all 1s, this can also carry into the exponent field.)

temp = temp & 0xFFFFFFFFE0000000;

This clears the low 29 bits. In the IEEE-754 binary32 and binary64 formats commonly used for float and double, the float significand has 24 bits (with 23 encoded in the main significand field), and the double significand has 53 bits (52 encoded in the main significand field), so the difference is 29. Thus, clearing the low 29 bits in the encoding of a double will produce a number exactly representable as a float, if the exponent is within float range.

The goal of clearing these bits is likely to prevent a second rounding upward during the conversion to float, coming below. The add in the previous line, temp += r;, may have caused a carry into the upper bits of the significand, so the intent is likely to ensure the number is only ever increased by one unit, not two.

return (float)*(double *)&temp;

As with the initial line above, this is a bad attempt to reinterpret the bits as a double. (After which it is cast to float, which is unnecessary from standard C alone since the operand of a return statement is automatically converted to the return type of the function, but, if strict code checking is used, it could silence a warning about a narrowing conversion.) Correct code would be memcpy(&x, &temp, sizeof x); return x; or return (union { uint64_t u; double d }) { temp } .u;.

was a pleasure to read your comment. I now fully understand what's happening, and am able to better fiddle with the code. Thank you very much for taking the time to give such a comprehensive explanation. Many kudos!!!! — notgrumpy, Feb 23 '21 at 12:50

bitwise stochastic rounding

1 Answers1