Verilog synthesis implementing cholesky decomposition

Question

I am implementing Cholesky decomposition in verilog, following python code below

def cholesky(A):
    n = len(A)

    L = [[0.0] * n for i in xrange(n)]

    for i in xrange(n):
        for j in xrange(i+1):
            tmp_sum = sum(L[i][k] * L[j][k] for k in xrange(j))

            if (i == j): # Diagonal element
                L[i][j] = sqrt(A[i][i] - tmp_sum)
            else:
                L[i][j] = (1.0/L[j][j] * (A[i][j] - tmp_sum))
    return L

I tried to do a simple one with 3x3 input size. Since it requires division and square root, I also write a division using standard method (copied from internet with some modification) and a sqrt using Babylonian method (a variant of Newton's method). Here they are:

Division

module Div(in1, in2, out);
input [23:0] in1, in2;
output reg [23:0] out;
// reg [23:0] remainder;

reg [47:0] scaled_divider, temp_remainder, temp_result;
integer i;

always @ (in1 or in2) begin
    scaled_divider = {1'b0, in2, 23'h0};
    temp_remainder = {24'h0, in1};

    for (i=0; i<24; i=i+1) begin
        temp_result = temp_remainder - scaled_divider;

        if (temp_result[47-i]) begin    // Negative result, quotient set to '0'
            out[23-i] = 1'b0;
        end else begin
            out[23-i] = 1'b1;
            temp_remainder = temp_result;
        end 

        scaled_divider = scaled_divider >> 1;
    end 

    // remainder =  temp_remainder[23:0];
end 

endmodule

Sqrt

module Sqrt_newton(in, out);

// 3 iterations
input [23:0] in; 
output reg [23:0] out;

Div div1(in, out, tmp_inout1);
Div div2(in, tmp_inout2, tmp_inout3);
Div div3(in, tmp_inout4, tmp_inout5);


always @ (in)
begin
    out[0] = 1'b1;
    out[1] = 1'b1;
    out[2] = 1'b1;
    out[3] = 1'b1;
    out[4] = 1'b1;
    out[5] = 1'b1;
    out[6] = 1'b1;
    out[7] = 1'b1;
    tmp_inout2 = (out + tmp_inout1) >> 1;
    tmp_inout4 = (tmp_inout2 + tmp_inout3) >> 1;
    out = (tmp_inout4 + tmp_inout5) >> 1;
end 
endmodule

And here's my 3x3 cholesky decomposition code:

module cholesky_template(clk, rst, g_input, e_input, o);
    input clk, rst;
    input [143:0] g_input;
    input e_input;
    output [215:0] o;
    reg [23:0] L [0:2][0:2];
    reg [23:0] A [0:2][0:2] ;

    assign o = {
        L[0][0], L[0][1], L[0][2],
        L[1][0], L[1][1], L[1][2],
        L[2][0], L[2][1], L[2][2]
        };

    reg [23:0] tmp_A00_minus_sum;
    reg [23:0] tmp_A11_minus_sum;
    reg [23:0] tmp_A22_minus_sum

    reg [23:0] tmp_A10_minus_sum;
    reg [23:0] tmp_A20_minus_sum;
    reg [23:0] tmp_A21_minus_sum;

    reg [23:0] div_1_L00;
    reg [23:0] div_1_L11;

    Sqrt sqrt0(tmp_A00_minus_sum, L[0][0]);
    Div div0(1'b1, L[0][0], div_1_L00);
    Sqrt sqrt1(tmp_A11_minus_sum, L[1][1]);
    Div div1(1'b1, L[1][1], div_1_L11);
    Sqrt sqrt2(tmp_A22_minus_sum, L[2][2]);

    always @ (posedge clk or posedge rst) begin
        if (rst)
            L[0][0] = 1'b0;
            L[0][1] = 1'b0;
            L[0][2] = 1'b0;
            L[1][0] = 1'b0;
            L[1][1] = 1'b0;
            L[1][2] = 1'b0;
            L[2][0] = 1'b0;
            L[2][1] = 1'b0;
            L[2][2] = 1'b0;
            tmp_sum = 1'b0;
            A[0][0] ={8'b00000000, g_input[15:0]};
            A[0][1] =24'b0; // will not be used
            A[0][2] =24'b0; // will not be used
            A[1][0] ={8'b00000000, g_input[63:48]};
            A[1][1] ={8'b00000000, g_input[79:64]};
            A[1][2] =24'b0; // will not be used
            A[2][0] ={8'b00000000, g_input[111:96]};
            A[2][1] ={8'b00000000, g_input[127:112]};
            A[2][2] ={8'b00000000, g_input[143:128]};
        end else begin
            tmp_A00_minus_sum = A[0][0] - tmp_sum;

            tmp_A10_minus_sum = A[1][0] - tmp_sum;
            L[1][0] = div_1_L00 * tmp_A10_minus_sum;

            tmp_sum = tmp_sum + L[1][0] * L[1][0];

            tmp_A11_minus_sum = A[1][1] - tmp_sum;

            tmp_A20_minus_sum = A[2][0] - tmp_sum;
            L[2][0] = div_1_L00 * tmp_A20_minus_sum;            

            tmp_sum = tmp_sum + L[2][0] * L[1][0];

            tmp_A21_minus_sum = A[2][1] - tmp_sum;
            L[2][1] = div_1_L11 * tmp_A21_minus_sum;

            tmp_sum = tmp_sum + L[2][0] * L[2][0];
            tmp_sum = tmp_sum + L[2][1] * L[2][1];

            tmp_A22_minus_sum = A[2][2] - tmp_sum;
        end
    end
endmodule

Some explanations on the code: I failed to use for-loops so I unrolled them to something like tmp_A10_minus_sum = A[1][0] - tmp_sum;. It should be fairly easy to map to the python code. The reason to insert 8 zeros before A is that I'll try to "upgrade" the code to a use 24 bits, so that it can gets more accurate. This is not the problem.

Three-state bus warnings

The problem is when I compile it using Synopsys DC, it outputs warnings like this:

"Warning: In design 'cholesky_template', three-state bus 'tmp_A00_minus_sum[23]' has non three-state driver 'tmp_A00_minus_sum_reg[23]/Q'. (LINT-34)"

This is DC's description of LINT-34:

NAME LINT-34 (warning) In design '%s', three-state bus '%s' has non three- state driver '%s'.

DESCRIPTION Synopsys libraries contain descriptions of three-state driving pins on components. Synopsys tools classify a net as a three-state net if it is driven by at least one pin that has this three-state attribute. Normally, if there are multiple drivers on such nets, it is assumed that all driving pins should be three-state drivers, for correct opera- tion of the three-state bus. This warning message indicates a situa- tion where at least one non-three-state driver appears on a three-state net.

WHAT_NEXT Verify that this is what you have intended for the given net. If the non-three-state driver pin specified in the message is really on a three-state driver in your ASIC technology, verify that the technology library description is correct.

Why there's three-state attributes in the design? How do I correct them?

Target library contains no replacement for register

This is another warning I get, for example:

Warning: Target library contains no replacement for register 'A_reg[1][0][7]' (FFGEN). (TRANS-4)

Here's my library code and I wonder if this has anything to do with three-state bus warning? If so, is there any reference to design the appropriate cells?

library(HML){
cell(AND)  {
  area: 6;
  pin(A) {
      direction: input;
      capacitance: 1;
  }    
  pin(B) {
      direction: input;
      capacitance: 1;  
    }
  pin(Z) {
    direction: output;
    function: "A B";
    timing() {
        intrinsic_rise: 0.48;
        intrinsic_fall: 0.77;
        rise_resistance: 0.1443;
        fall_resistance: 0.0523;
        slope_rise: 0.0;
        slope_fall: 0.0;
        related_pin: "A";   
        }
    timing() {
        intrinsic_rise: 0.48;
        intrinsic_fall: 0.77;
        rise_resistance: 0.1443;
        fall_resistance: 0.0523;
        slope_rise: 0.0;
        slope_fall: 0.0;
        related_pin: "B";   
        }
    }
  }
cell(OR) {
  area:  6;
  pin(A) {
    direction: input;
    capacitance: 1;
  }
  pin(B) {
    direction: input;
    capacitance: 1;
  }
  pin(Z) {
    direction: output;
    function: "A+B";
    timing() {
        intrinsic_rise: 0.28;
        intrinsic_fall: 0.85;
        rise_resistance: 0.1443;
        fall_resistance: 0.0589;
        slope_rise: 0.0;
        slope_fall: 0.0;
        related_pin: "A";   
    }
    timing() {
        intrinsic_rise: 0.28;
        intrinsic_fall: 0.85;
        rise_resistance: 0.1443;
        fall_resistance: 0.0589;
        slope_rise: 0.0;
        slope_fall: 0.0;
        related_pin: "B";   
    }
  }
}
cell(XOR) {
  area: 0;
  pin(A) {
    direction: input;
    capacitance: 1;
  }
  pin(B) {
    direction: input;
    capacitance: 1
  }
  pin(Z) {
    direction: output;
    function: "A^B";
    timing() {
        intrinsic_rise: 0.28;
        intrinsic_fall: 0.85;
        rise_resistance: 0.1443;
        fall_resistance: 0.0589;
        slope_rise: 0.0;
        slope_fall: 0.0;
        related_pin: "A";   
    }
    timing() {
        intrinsic_rise: 0.28;
        intrinsic_fall: 0.85;
        rise_resistance: 0.1443;
        fall_resistance: 0.0589;
        slope_rise: 0.0;
        slope_fall: 0.0;
        related_pin: "B";   
    }
  }
}
cell(NAND) {
  area: 6;
  pin(A) {
    direction: input;
    capacitance: 1;
  }
  pin(B) {
    direction: input;
    capacitance: 1
  }
  pin(Z) {
    direction: output;
    function: "(A B)'";
    timing() {
        intrinsic_rise: 0.28;
        intrinsic_fall: 0.85;
        rise_resistance: 0.1443;
        fall_resistance: 0.0589;
        slope_rise: 0.0;
        slope_fall: 0.0;
        related_pin: "A";   
    }
    timing() {
        intrinsic_rise: 0.28;
        intrinsic_fall: 0.85;
        rise_resistance: 0.1443;
        fall_resistance: 0.0589;
        slope_rise: 0.0;
        slope_fall: 0.0;
        related_pin: "B";   
    }
  }
}
cell(NOR) {
  area: 6;
  pin(A) {
    direction: input;
    capacitance: 1;
  }
  pin(B) {
    direction: input;
    capacitance: 1
  }
  pin(Z) {
    direction: output;
    function: "(A+B)'";
    timing() {
        intrinsic_rise: 0.28;
        intrinsic_fall: 0.85;
        rise_resistance: 0.1443;
        fall_resistance: 0.0589;
        slope_rise: 0.0;
        slope_fall: 0.0;
        related_pin: "A";   
    }
    timing() {
        intrinsic_rise: 0.28;
        intrinsic_fall: 0.85;
        rise_resistance: 0.1443;
        fall_resistance: 0.0589;
        slope_rise: 0.0;
        slope_fall: 0.0;
        related_pin: "B";   
    }
  }
}

cell(XNOR) {
  area: 6;
  pin(A) {
    direction: input;
    capacitance: 1;
  }
  pin(B) {
    direction: input;
    capacitance: 1
  }
  pin(Z) {
    direction: output;
    function: "(A^B)'";
    timing() {
        intrinsic_rise: 0.28;
        intrinsic_fall: 0.85;
        rise_resistance: 0.1443;
        fall_resistance: 0.0589;
        slope_rise: 0.0;
        slope_fall: 0.0;
        related_pin: "A";   
    }
    timing() {
        intrinsic_rise: 0.28;
        intrinsic_fall: 0.85;
        rise_resistance: 0.1443;
        fall_resistance: 0.0589;
        slope_rise: 0.0;
        slope_fall: 0.0;
        related_pin: "B";   
    }
  }
}

cell(DFF) {
  area : 9;
  pin(D) {
    direction : input;
    capacitance : 1;
    timing() {
      timing_type : setup_rising;
      intrinsic_rise : 0.85;
      intrinsic_fall : 0.85;
      related_pin : "CLK";
    }
    timing() {
      timing_type : hold_rising;
      intrinsic_rise : 0.4;
      intrinsic_fall : 0.4;
      related_pin : "CLK";
    }
  }
    pin(I) {
    direction : input;
    capacitance : 1;
    timing() {
      timing_type : setup_rising;
      intrinsic_rise : 0.85;
      intrinsic_fall : 0.85;
      related_pin : "CLK";
    }
    timing() {
      timing_type : hold_rising;
      intrinsic_rise : 0.4;
      intrinsic_fall : 0.4;
      related_pin : "CLK";
    }
  }
  pin(CLK) {
    direction : input;
    capacitance : 1;
  }
  pin(RST) {
    direction : input;
    capacitance : 2;
  }

  ff("IQ", "IQN") {
    next_state : "D";
    clocked_on : "CLK";
    clear : "RST (I')";
    preset: "RST I";
    clear_preset_var1: L;
    clear_preset_var2: H;
  }

  pin(Q) {
    direction : output;
    function : "IQ";
    internal_node : "Q";
    timing() {
      timing_type : rising_edge;
      intrinsic_rise : 1.19;
      intrinsic_fall : 1.37;
      rise_resistance : 0.1458;
      fall_resistance : 0.0523;
      related_pin : "CLK";
    }
    timing() {
      timing_type : clear;
      timing_sense : positive_unate;
      intrinsic_fall : 1.29;
      fall_resistance : 0.0516;
      related_pin : "RST";
    }
    timing() {
      timing_type : preset;
      timing_sense : positive_unate;
      intrinsic_fall : 1.29;
      fall_resistance : 0.0516;
      related_pin : "I";
    }
  }
}
cell(IV){
  area:0;
  cell_footprint : "iv";
  pin(A) {
    direction: input;
    capacitance: 1;
  }
  pin(Z) {
    direction: output;
    function : "A'";
    timing() {
      intrinsic_rise : 0.38;
      intrinsic_fall : 0.15;
      rise_resistance : 0.1443;
      fall_resistance : 0.0589;
      slope_rise : 0.0;
      slope_fall     : 0.0;
      related_pin : "A";
    }
  }
}
}

Sorry for being a long post. I hope I asked my questions clearly.

I'm not sure if this is related to your issue, but your `Sqrt_newton` looks strange with its asynchronous feedback. Plus I'd be supersized if it compiles with `tmp_inout2` and `tmp_inout4` not being declared. It's always block sensitivity list is incomplete; wouldn't be an issue if you skip simulation and go straight to synthesis, but logic bugs are easier caught in simulation. A poorly written sensitivity list leads to behavior mismatches between simulation and synthesis. — Greg, Mar 19 '17 at 22:01
@Greg Thank you, is asynchrnous feedback not allowed? I've change it to synchronous feedback and it seems to work — xtt, Mar 20 '17 at 01:11
Asynchronous feedback is tricky. To work, they need to be self stabilizing (e.g. get into a settled, non-oscillating state/values). Improperly balanced gate propagation delay, RC parasitic, temperature/voltage variation, and anything else that can impact timing can throw an synchronous feedback design into unexpected and/or oscillating output if there are not accounted for in design. Synchronous design doesn't have this design challenge, which is why it is more common. — Greg, Mar 20 '17 at 01:25
Thank you. But I see Div is also asynchronous, will there be any problem? How do I design a synchronous division? — xtt, Mar 20 '17 at 02:55
`Div` on its own is a linear chain; it doesn't feedback onto itself. It does take time to resolve. That time may be longer than a clock cycle which is something to look out for and will so up in your static timing analysis. — Greg, Mar 20 '17 at 05:10

score 0 · Answer 1 · answered Oct 07 '17 at 16:20

This is late but I just came across it. I'm not sure about the 3-state stuff, but I just ran into your FFGEN error. The synthesizer uses the parts it has available to compile your code to a list of gates. When you specify behavior in your vhdl for which the library has no part to implement that behavior (in my case, a flipflip (FF) with an asynchronous reset), the synthesizer doesn't know what kind of part to use when its going through and (GEN)erating parts, hence the error FFGEN. The synthesizer will, however, put down a placeholder for that register describing the input output and clock signal of that element (which you can see if you look through your netlist. Mine looks like this.

\**FFGEN** \inst_clk_divider/cnt_reg[1] ( .next_state(n299), .clocked_on(clk), .force_00(1'b0), .force_01(rst), .force_10(1'b0), .force_11(1'b0), .Q(\inst_clk_divider/cnt[1] ) );

Verilog synthesis implementing cholesky decomposition

Division

Sqrt

Three-state bus warnings

Target library contains no replacement for register

1 Answers1