I would like to emulate various n-bit binary floating-point formats, each with a specified e_max and e_min, with p bits of precision. I would like these formats to emulate subnormal numbers, faithful to the IEEE-754 standard.
Naturally, my search has lead me to the MPFR library, being IEEE-754 compliant and able to support subnormals with the mpfr_subnormalize()
function. However, I've ran into some confusion using mpfr_set_emin()
and mpfr_set_emax()
to correctly set up a subnormal-enabled environment. I will use IEEE double-precision as an example format, since this is the example used in the MPFR manual:
http://mpfr.loria.fr/mpfr-current/mpfr.html#index-mpfr_005fsubnormalize
mpfr_set_default_prec (53);
mpfr_set_emin (-1073); mpfr_set_emax (1024);
The above code is from the MPFR manual in the above link - note that neither e_max nor e_min are equal to the expected values for double
. Here, p is set to 53, as expected of the double
type, but e_max is set to 1024, rather than the correct value of 1023, and e_min is set to -1073; well below the correct value of -1022. I understand that setting the exponent bounds too tightly results in overflow/underflow in intermediate computations in MPFR, but I have found that setting e_min exactly is critical for ensuring correct subnormal numbers; too high or too low causes a subnormal MPFR result (updated with mprf_subnormalize()
) to differ from the corresponding double
result.
My question is how should one decide which values to pass to mpfr_set_emax()
and (especially) mpfr_set_emin()
, in order to guarantee correct subnormal behaviour for a floating-point format with exponent bounds e_max and e_min? There doesn't seem to be any detailed documentation or discussion on the matter.
With all my thanks,
James.
EDIT 30/07/16: Here is a small program which demonstrates the choice of e_max and e_min for single-precision numbers.
#include <iostream>
#include <cmath>
#include <float.h>
#include <mpfr.h>
using namespace std;
int main (int argc, char *argv[]) {
cout.precision(120);
// IEEE-754 float emin and emax values don't work at all
//mpfr_set_emin (-126);
//mpfr_set_emax (127);
// Not quite
//mpfr_set_emin (-147);
//mpfr_set_emax (128);
// Not quite
//mpfr_set_emin (-149);
//mpfr_set_emax (128);
// These float emin and emax values work in subnormal range
mpfr_set_emin (-148);
mpfr_set_emax (128);
cout << "emin: " << mpfr_get_emin() << " emax: " << mpfr_get_emax() << endl;
float f = FLT_MIN;
for (int i = 0; i < 3; i++) f = nextafterf(f, INFINITY);
mpfr_t m;
mpfr_init2 (m, 24);
mpfr_set_flt (m, f, MPFR_RNDN);
for (int i = 0; i < 6; i++) {
f = nextafterf(f, 0);
mpfr_nextbelow(m);
cout << i << ": float: " << f << endl;
//cout << i << ": mpfr: " << mpfr_get_flt (m, MPFR_RNDN) << endl;
mpfr_subnormalize (m, 1, MPFR_RNDN);
cout << i << ": mpfr: " << mpfr_get_flt (m, MPFR_RNDN) << endl;
}
mpfr_clear (m);
return 0;
}