Subnormal numbers in different precisions with MPFR

Question

I would like to emulate various n-bit binary floating-point formats, each with a specified e_max and e_min, with p bits of precision. I would like these formats to emulate subnormal numbers, faithful to the IEEE-754 standard.

Naturally, my search has lead me to the MPFR library, being IEEE-754 compliant and able to support subnormals with the mpfr_subnormalize() function. However, I've ran into some confusion using mpfr_set_emin() and mpfr_set_emax() to correctly set up a subnormal-enabled environment. I will use IEEE double-precision as an example format, since this is the example used in the MPFR manual:

http://mpfr.loria.fr/mpfr-current/mpfr.html#index-mpfr_005fsubnormalize

mpfr_set_default_prec (53);
mpfr_set_emin (-1073); mpfr_set_emax (1024);

The above code is from the MPFR manual in the above link - note that neither e_max nor e_min are equal to the expected values for double. Here, p is set to 53, as expected of the double type, but e_max is set to 1024, rather than the correct value of 1023, and e_min is set to -1073; well below the correct value of -1022. I understand that setting the exponent bounds too tightly results in overflow/underflow in intermediate computations in MPFR, but I have found that setting e_min exactly is critical for ensuring correct subnormal numbers; too high or too low causes a subnormal MPFR result (updated with mprf_subnormalize()) to differ from the corresponding double result.

My question is how should one decide which values to pass to mpfr_set_emax() and (especially) mpfr_set_emin(), in order to guarantee correct subnormal behaviour for a floating-point format with exponent bounds e_max and e_min? There doesn't seem to be any detailed documentation or discussion on the matter.

With all my thanks,

James.

EDIT 30/07/16: Here is a small program which demonstrates the choice of e_max and e_min for single-precision numbers.

#include <iostream>
#include <cmath>
#include <float.h>
#include <mpfr.h>

using namespace std;

int main (int argc, char *argv[]) {
    cout.precision(120);

    // IEEE-754 float emin and emax values don't work at all
    //mpfr_set_emin (-126);
    //mpfr_set_emax (127);

    // Not quite
    //mpfr_set_emin (-147);
    //mpfr_set_emax (128);

    // Not quite
    //mpfr_set_emin (-149);
    //mpfr_set_emax (128);

    // These float emin and emax values work in subnormal range
    mpfr_set_emin (-148);
    mpfr_set_emax (128);

    cout << "emin: " << mpfr_get_emin() << "    emax: " << mpfr_get_emax() << endl;

    float f = FLT_MIN;
    for (int i = 0; i < 3; i++) f = nextafterf(f, INFINITY);

    mpfr_t m;
    mpfr_init2 (m, 24);
    mpfr_set_flt (m, f, MPFR_RNDN);

    for (int i = 0; i < 6; i++) {
        f = nextafterf(f, 0);
        mpfr_nextbelow(m);
        cout << i << ": float: " << f << endl;
        //cout << i << ":  mpfr: " << mpfr_get_flt (m, MPFR_RNDN) << endl;
        mpfr_subnormalize (m, 1, MPFR_RNDN);
        cout << i << ":  mpfr: " << mpfr_get_flt (m, MPFR_RNDN) << endl;
    }

    mpfr_clear (m);
    return 0;
}

Hmmm, Does _e_min_ reflect `DBL_MIN` (min pos. normal value) or `DBL_TRUE_MIN` (min pos. subnormal value)? — chux - Reinstate Monica, Jul 29 '16 at 18:36
Off-by-1 issues are common with exponent range specs concerning FP formats. Depends on the model and signicand format `x.xxx...xxx` or `.xxxx...xxx`. — chux - Reinstate Monica, Jul 29 '16 at 18:38
I don't follow. It is unsurprising that the configured *e_min* affects the behavior of `mpfr_subnormalize()`, as for any given small input, that is what determines how many significant bits of precision there will be in the result, and indeed, whether the result is subnormal (from MPFR's perspective) at all. Why do you suppose that you would set MPFR's exponent bounds to anything different from the bounds applicable to the format you want to use? — John Bollinger, Jul 29 '16 at 19:05
@chux I let *e_min* be the minimum value that the exponent portion of the floating-point format is able to take - e.g. -126, for single-precision floats. @John This is what I'm wondering. However, say I emulate single-precision IEEE floats, if I set the MPFR *e_min* to the minimum exponent value for `float` (-126), the MPFR subnormals do not match `float` subnormals. They only match when I set MPFR *e_min* to -148, bizarrely. I assume that the extra lower exponent values are for internal MPFR computation purposes, but, in general, for an *n*-bit format, I don't know what *e_min* should be. — James Paul Turner, Jul 29 '16 at 20:49
An answer to this question, courtesy of Vincent Lefèvre, has been given at: https://www.researchgate.net/post/MPFR_In_general_what_should_the_values_of_e_max_and_e_min_be_to_get_correct_subnormal_numbers_in_different_precisions — James Paul Turner, Jul 31 '16 at 00:37

vinc17 · Accepted Answer · 2019-07-30T15:25:42.570

I'm copying my answer I gave on ResearchGate (with a link to the mpfr_subnormalize documentation):

There are different conventions to express significands and the associated exponents. IEEE 754 chooses to consider significands between 1 and 2, while MPFR (like the C language, see DBL_MAX_EXP for instance) chooses to consider significands between 1/2 and 1 (for practical reasons related to multiple precision). For instance, the number 17 is represented as 1.0001·2⁴ in IEEE 754 and as 0.10001·2⁵ in MPFR. As you can see, this means that exponents are increased by 1 in MPFR compared to IEEE 754, hence e_max = 1024 instead of 1023 for double precision.

Concerning the choice of e_min for double precision, one needs to be able to represent 2⁻¹⁰⁷⁴ = 0.1·2⁻¹⁰⁷³, so that e_min needs to be at most −1073 (as in MPFR, all numbers are normalized).

As documented, the mpfr_subnormalize function considers that the subnormal exponent range is from e_min to e_min + PREC(x) − 1, so that for instance, you need to set e_min = −1073 to emulate IEEE double precision.

Thanks Vincent; that helps a lot. Just to be double sure that I understand, IEEE-754 exponents are offset by +1 in MPFR, and if I were to want a *p*-bit precision floating-point format, with a normalised exponent range lower bound of *e*, and correct subnormals, I would need to `mpfr_set_emin` with *e - (p - 1)*? Cheers. — James Paul Turner, Aug 01 '16 at 15:26
@JamesTurner Basically you need to consider the least positive subnormal of the system. If it is 2^e, then you need to call `mpfr_set_emin` with e+1. For instance, for double precision, the least positive subnormal is 2^(−1022−53+1) = 2^(−1074), so that you need to call `mpfr_set_emin` with −1074+1 = −1073. That's emin−p+1 with emin expressed in the MPFR/C convention, or emin−p+2 with emin expressed in the IEEE-754 binary convention. — vinc17, Aug 02 '16 at 20:15

Subnormal numbers in different precisions with MPFR

1 Answers1