-1

I'm working on a Digital Design project (Verilog) involving IEEE double precision floating point standard.

I have a query regarding IEEE floating number representation. In IEEE floating point representation, the numbers are represented in normalized format, which implies that the significand bit is assumed to be 1 by default (also known as hidden bit).

When a float number is de-normalized, the significand bit is considered 0, and the exponent is made 0 by shifting the decimal point to left.

My query is regarding de-normalization procedure. For example , if the exponent can be as high as 120, in such a case, how do we treat the fractional bits (43 bits for IEEE - double precision) ?

Do we do the following

1) Increase the width of fraction ? i.e. 43 fraction bits + De-normalization => 43 + eg 43 +120 = 163 bits ?

2) Simply shift the bits and maintain the width of fraction as it is ? i.e. discard excessive bits ?

Displayname
  • 25
  • 1
  • 10

3 Answers3

1

The only unnormalized numbers in IEEE binary floating point are those with zero in the exponent field, corresponding to the smallest possible exponent. They keep the normal fraction width, so as the number of leading zeros increases the precision decreases. That is a good trade-off for tiny numbers, making underflow smoother.

Patricia Shanahan
  • 25,849
  • 4
  • 38
  • 75
0

Two comments.

First, a double-precision (64-bit) floating-point number has 52 explicit bits for the mantissa, plus one implied bit (not 43 bits as you stated).

Second, only values with an all-bits-zero exponent are interpreted as denormalized. This allows precision to gracefully degrade as values approach zero.

Steve Hollasch
  • 2,011
  • 1
  • 20
  • 18
0
******************************************************************************************
HERE IS THE CODE I IMPLEMENTED FOR DENORMALIZATION OF A IEEE 754 DOUBLE PRECISION NUMBER
******************************************************************************************

module denorm_orig(D_in, Dnorm); 

input  [63:0]D_in;        // In IEEE 754 double precision format 
output reg [63:0]  Dnorm;  

reg [63:0] fract_U1;
reg [10:0] exponent_U1;

always@(*) begin

fract_U1 = {1'b1,D_in[51:0],11'b0};           // Fraction part - denormalized 64 bits 
exponent_U1  = (11'd1022- D_in[62:52]);       // Exponent part 

fract_U1 = (exponent_U1[5])?{32'b0,fract_U1[63:32]}: {fract_U1 };   // Check if this (32nd or 5th) bit is zero , if not zero , then  keep the value as it is 
fract_U1 = (exponent_U1[4])?{16'b0,fract_U1[63:16]}: {fract_U1 };   
fract_U1 = (exponent_U1[3])?{ 8'b0,fract_U1[63:8 ]}: {fract_U1 };   
fract_U1 = (exponent_U1[2])?{ 4'b0,fract_U1[63:4 ]}: {fract_U1 };   
fract_U1 = (exponent_U1[1])?{ 2'b0,fract_U1[63:2 ]}: {fract_U1 };   
fract_U1 = (exponent_U1[0])?{ 1'b0,fract_U1[63:1 ]}: {fract_U1 };   

Dnorm = fract_U1 [63:55];

end

endmodule
Displayname
  • 25
  • 1
  • 10