3

I have designed a matrix-vector multiplier with systolic array architecture. I finally got the simulation to work. Now that I want to synthesize the design it seems that the data_flow control block is not synthesizable (the always block). And I think it because of using for loops with variable number of iterations. Could you please give me some tips to make it synthesizable?

My design has 64 (8 by 8 fixed) processing elements (PE) to do the multiply and accumulation (MAC) for the matrix multiplication, and I used a subset/all of them depending on inputs dimensions (up to 64 * 64). It gets the dimensions of the inputs (the matrix and the vector) from external CPU. For example, if (M*N) matrix has M=5 and N=5 and (N *1) vector the result will be (M *1) vector and we will be using M_val number of PEs, that is 5. I used activation signal to activate M_val number of PEs.

EDIT: With the following code I get the message below from design compiler:

Warning: /local/home/synth/systolic.sv:46: Out of bounds bit select W_reg[64], valid bounds are [0:63]. (ELAB-312) Error: /local/home/synth/systolic.sv:45: Loop exceeded maximum iteration limit. (ELAB-900) *** Presto compilation terminated with 1 errors. ***

When I change the M_val upperbounds of the for loops to M and change the m=cycle-N_val+1 to m = 1, it will synthesize but it doesn't simulate correctly (does not produce the right result for the multiplication).

Here is my code:

module systolic #(parameter DW = 8,

         // fixed! not meant to be change from outside
             parameter M = 8,
             parameter N = 8)
             (
                input clk,
                input reset,
                input reg[15:0] M_val,     // number of rows 
                input reg [15:0] N_val,    // number of columns
                input start_mult,

                input  [DW-1:0] W_i [0:M*N-1],
                input  [DW-1:0] X_i [0:N-1],
                
                output [2*DW:0] Y_o [0:M-1],
                output reg mult_done 
             );
              


reg [7:0] cycle;    // counts the cycles of multiplication process

reg  [DW-1:0] W_reg[0:M-1]; // regfile to hold W matrix elements
reg  [DW-1:0] X_reg;        // register to hold X vector elements at each cycle
integer m;

// ROW-MAJOR w_i
always @(posedge clk) begin
  if (!reset) begin
    cycle <= 8'd0;
    mult_done <= 1'b0;
    X_reg <= 8'd0;
    for(m = 0; m < M; m = m + 1) begin
      W_reg[m] <= 8'd0;
    end
  end
  else if (start_mult) begin
    if (cycle == (M_val + N_val)) begin   // the number of cycles needed for the multiplication
      mult_done <= 1'b1;
      cycle <= 8'd0;
      for(m = 0; m < M_val; m = m + 1) begin // if change to M_val -> M
        W_reg[m] <= 8'd0;
      end
    end else if (cycle < N_val) begin        // N_val is the number of times we have to shift X values
      X_reg <= X_i[cycle];
      for(m = 0; m < M_val; m = m + 1) begin // if change to M_val -> M
        W_reg[m] <= W_i[(cycle-m) + m*N_val];

      end
    end else begin // if (cycle >= N) X will get zeros, its elements has been shifted to the last PE 
      X_reg <= 8'd0;
      
      // after N cycles the first PE is done processing, so the m index starts from 1, 
      // or we are feeding W elements to the PEs other than the first one.
      
       for (m=cycle-N_val+1; m < M_val; m = m + 1) begin //if change M_val -> M && m=cycle-N_val+1 it it will synthesize
        W_reg[m] <= W_i[(cycle-m) + m*N_val];
      end
    end
    cycle <= cycle + 8'd1;
  end
end
  

wire  [DW-1:0] Ws[1:0][0:M-1];
wire  [DW-1:0] Xs [0:M];


wire [2*DW:0] Ys [0:M-1];

reg [M-1:0] activate_reg;
wire activate_pe [M-1:0];

// at first all the PEs are activated.
always@(posedge reset or posedge start_mult) begin
    if(!reset)
      activate_reg = 64'hFFFF_FFFF_FFFF_FFFF;
    else if (M_val != 64)
      activate_reg  = (activate_reg >> M - M_val);
    else 
      activate_reg = 64'hFFFF_FFFF_FFFF_FFFF;
end




genvar i,j;

generate
  
  for (i = 0; i < M; i = i + 1) begin: PE_activators
    assign activate_pe[i] = activate_reg[i];
  end 
  
  for (i = 0; i < M; i = i + 1) begin: Weights
    assign Ws[0][i] = W_reg[i];
  end 

  assign Xs[0] = X_reg;

  
  for (i = 0; i < 8; i = i + 1) begin: ROWs
    for (j = 0; j < 8; j = j + 1) begin: COLs
     PE #(DW)
         pe (
            .clk(clk),
            .reset(reset),
            .activate(activate_pe[i*8+j]),
            .w_i(Ws[0][i*8+j]),
            .x_i(Xs[i*8+j]),
            .w_o(),            // This can be float
            .x_o(Xs[i*8+j+1]),
            .mac(Ys[i*8+j])
         );
    end
  end
  
endgenerate

assign Y_o = Ys;

endmodule

module PE #(parameter DW = 8)
       (
        input clk,
        input reset,
        input activate,
        
        input  [DW-1:0]w_i,
        input  [DW-1:0]x_i,
        
        output reg  [DW-1:0] w_o,
        output reg  [DW-1:0] x_o,

        output reg [2*DW:0] mac
       );

wire  [2*DW:0] multiply = w_i * x_i;
always @(posedge clk) begin
  if(!reset) begin
    w_o <= {DW{1'b0}};
    x_o <= {DW{1'b0}};
    mac <= {(2*DW +1){1'b0}};
  end
  else begin
    if(activate == 1) begin
      w_o <= w_i;
      x_o <= x_i;
      mac <= mac + multiply;

    end
  end

end

endmodule

So for (4*4) Matrix W and (4**1) vector X,

W = {w44,w43,w42,w41, w34,w33,w32,w31, w24,w23,w22, w14,w13,w12,w11};

X = {x4, x3, x2, x1};

the dataflow/Timing for 4 PEs would be like this (please let me know if you need more info):

enter image description here

engineer1155
  • 36
  • 13
  • *it seems that the data_flow control block is not synthesizable* -- what does it mean? Did you get synthesis tool errors? warning? what do they say? – Serge Mar 29 '23 at 23:10
  • As you stated, the tools don't want a variable number of loop iterations. The code loops over W_reg a maximum of M times. In the case where you want to loop less than M times, what do you want to happen with the other values that are not getting assigned in the loop? For example do you want them to stay the same? – Mikef Mar 30 '23 at 03:17
  • @Mikef Thanks for your time. The other values can be zero or float. Basically, I want hardware generated for 'M' iterations. But using 'M_val' number of them, and If M_val is equal to M then using all of them, or assign values to all of them – engineer1155 Mar 30 '23 at 03:36

1 Answers1

1

Change your loops to be a constant number of iterations and use an if statement to control what happens. E.g. change this:

        for (m=cycle-N_val+1; m < M_val; m = m + 1)
            W_reg[m] <= W_i[(cycle-m) + m*N_val];

to this:

        for (m = 0; m < M; m = m + 1)
            if ((m >= cycle-N_val+1) && (m < M_val))
                W_reg[m] <= W_i[(cycle-m) + m*N_val];

Or if you don't need the W_reg values to remain unchanged then just take the if out completely.

Also note that while this code looks simple, it's creating lots of multiplexers, multipliers and other things that need to complete in a single clock cycle. You might need to break this down into a longer pipeline to get good performance out of it.

Justin N
  • 780
  • 4
  • 7
  • Thank you for your answer. Could you please give some suggestion on how to make the pipeline longer? design an FSM? – engineer1155 Mar 30 '23 at 03:48
  • The general idea would be to do the multiply and subtract, save the results in a registers, then do the add, save that in a register, then do the mux, etc. For things like the multiply, you may want to use primitive blocks that do that, or write your code in a way that they can be inferred. – Justin N Mar 30 '23 at 05:16