I have designed a matrix-vector multiplier with systolic array architecture. I finally got the simulation to work. Now that I want to synthesize the design it seems that the data_flow control block is not synthesizable (the always block). And I think it because of using for loops with variable number of iterations. Could you please give me some tips to make it synthesizable?
My design has 64 (8 by 8 fixed) processing elements (PE) to do the multiply and accumulation (MAC) for the matrix multiplication, and I used a subset/all of them depending on inputs dimensions (up to 64 * 64). It gets the dimensions of the inputs (the matrix and the vector) from external CPU. For example, if (M*N) matrix has M=5 and N=5 and (N *1) vector the result will be (M *1) vector and we will be using M_val number of PEs, that is 5. I used activation signal to activate M_val number of PEs.
EDIT: With the following code I get the message below from design compiler:
Warning: /local/home/synth/systolic.sv:46: Out of bounds bit select W_reg[64], valid bounds are [0:63]. (ELAB-312) Error: /local/home/synth/systolic.sv:45: Loop exceeded maximum iteration limit. (ELAB-900) *** Presto compilation terminated with 1 errors. ***
When I change the M_val upperbounds of the for loops to M and change the m=cycle-N_val+1 to m = 1, it will synthesize but it doesn't simulate correctly (does not produce the right result for the multiplication).
Here is my code:
module systolic #(parameter DW = 8,
// fixed! not meant to be change from outside
parameter M = 8,
parameter N = 8)
(
input clk,
input reset,
input reg[15:0] M_val, // number of rows
input reg [15:0] N_val, // number of columns
input start_mult,
input [DW-1:0] W_i [0:M*N-1],
input [DW-1:0] X_i [0:N-1],
output [2*DW:0] Y_o [0:M-1],
output reg mult_done
);
reg [7:0] cycle; // counts the cycles of multiplication process
reg [DW-1:0] W_reg[0:M-1]; // regfile to hold W matrix elements
reg [DW-1:0] X_reg; // register to hold X vector elements at each cycle
integer m;
// ROW-MAJOR w_i
always @(posedge clk) begin
if (!reset) begin
cycle <= 8'd0;
mult_done <= 1'b0;
X_reg <= 8'd0;
for(m = 0; m < M; m = m + 1) begin
W_reg[m] <= 8'd0;
end
end
else if (start_mult) begin
if (cycle == (M_val + N_val)) begin // the number of cycles needed for the multiplication
mult_done <= 1'b1;
cycle <= 8'd0;
for(m = 0; m < M_val; m = m + 1) begin // if change to M_val -> M
W_reg[m] <= 8'd0;
end
end else if (cycle < N_val) begin // N_val is the number of times we have to shift X values
X_reg <= X_i[cycle];
for(m = 0; m < M_val; m = m + 1) begin // if change to M_val -> M
W_reg[m] <= W_i[(cycle-m) + m*N_val];
end
end else begin // if (cycle >= N) X will get zeros, its elements has been shifted to the last PE
X_reg <= 8'd0;
// after N cycles the first PE is done processing, so the m index starts from 1,
// or we are feeding W elements to the PEs other than the first one.
for (m=cycle-N_val+1; m < M_val; m = m + 1) begin //if change M_val -> M && m=cycle-N_val+1 it it will synthesize
W_reg[m] <= W_i[(cycle-m) + m*N_val];
end
end
cycle <= cycle + 8'd1;
end
end
wire [DW-1:0] Ws[1:0][0:M-1];
wire [DW-1:0] Xs [0:M];
wire [2*DW:0] Ys [0:M-1];
reg [M-1:0] activate_reg;
wire activate_pe [M-1:0];
// at first all the PEs are activated.
always@(posedge reset or posedge start_mult) begin
if(!reset)
activate_reg = 64'hFFFF_FFFF_FFFF_FFFF;
else if (M_val != 64)
activate_reg = (activate_reg >> M - M_val);
else
activate_reg = 64'hFFFF_FFFF_FFFF_FFFF;
end
genvar i,j;
generate
for (i = 0; i < M; i = i + 1) begin: PE_activators
assign activate_pe[i] = activate_reg[i];
end
for (i = 0; i < M; i = i + 1) begin: Weights
assign Ws[0][i] = W_reg[i];
end
assign Xs[0] = X_reg;
for (i = 0; i < 8; i = i + 1) begin: ROWs
for (j = 0; j < 8; j = j + 1) begin: COLs
PE #(DW)
pe (
.clk(clk),
.reset(reset),
.activate(activate_pe[i*8+j]),
.w_i(Ws[0][i*8+j]),
.x_i(Xs[i*8+j]),
.w_o(), // This can be float
.x_o(Xs[i*8+j+1]),
.mac(Ys[i*8+j])
);
end
end
endgenerate
assign Y_o = Ys;
endmodule
module PE #(parameter DW = 8)
(
input clk,
input reset,
input activate,
input [DW-1:0]w_i,
input [DW-1:0]x_i,
output reg [DW-1:0] w_o,
output reg [DW-1:0] x_o,
output reg [2*DW:0] mac
);
wire [2*DW:0] multiply = w_i * x_i;
always @(posedge clk) begin
if(!reset) begin
w_o <= {DW{1'b0}};
x_o <= {DW{1'b0}};
mac <= {(2*DW +1){1'b0}};
end
else begin
if(activate == 1) begin
w_o <= w_i;
x_o <= x_i;
mac <= mac + multiply;
end
end
end
endmodule
So for (4*4) Matrix W and (4**1) vector X,
W = {w44,w43,w42,w41, w34,w33,w32,w31, w24,w23,w22, w14,w13,w12,w11};
X = {x4, x3, x2, x1};
the dataflow/Timing for 4 PEs would be like this (please let me know if you need more info):