0

I have an application where I'm continuously writing to a block ram at a slow clock speed (clk_a) and within this slow clock cycle need to read three indexes from the block ram at a fast clock speed (clk_b) to use these values as operands in a math module, the result being written back to the block ram on the next slow clock. These three indexes are the current address written to at posedge of the slow clock, plus the two immediate neighbouring addresses (addr_a -1 and addr_a +1).

What is an efficient way to synthesize this? My best attempt to date uses a small counter (triplet) running at fast clock rate that increments the addresses but I end up running out of logic as it looks like Yosys does not infer the ram properly. What is a good strategy for this?

here is what I have:

module myRam2 (
 input clk_a,
 input clk_b,
 input we_a,
 input re_a,
 input [10:0] addr_a,
 input [10:0] addr_b,
 input [11:0] din_a,
 output [11:0] leftNeighbor,
 output [11:0] currentX,
 output [11:0] rightNeighbor
);
  parameter MEM_INIT_FILE2 = "";
 initial
    if (MEM_INIT_FILE2 != "")
      $readmemh(MEM_INIT_FILE2, ram2);
     
reg [11:0] ram2 [0:2047];
reg [1:0] triplet = 3;
reg [10:0] old_addr_a;
reg [11:0] temp;

always @(posedge clk_a) begin
    ram2[addr_a] <= din_a;
end

always@(posedge clk_b) 
if (old_addr_a != addr_a) begin
        triplet <= 0;
        old_addr_a <= addr_a;
        end
    else 
        if(triplet < 3) begin
            triplet <= triplet +1;
        end



  
  always @(posedge clk_b) begin
        temp <= ram2[addr_a + (triplet - 1)];
end

always @(posedge clk_b) begin
case(triplet)
0: leftN <= temp;
1: X <= temp;
2: rightN <= temp;
endcase
end



reg signed [11:0] leftN;
reg signed [11:0] X;
reg signed [11:0] rightN;


assign leftNeighbor = leftN;
assign currentX = X;
assign rightNeighbor = rightN;

endmodule
ke10g
  • 27
  • 4
  • I tested this code and it infers memory fine using the Yosys I have here, and I don't see any obvious inference-related issues either (haven't checked that the logic is actually OK, though). – gatecat Jul 17 '20 at 09:24
  • The logic utilisation I see is ``` SB_CARRY 11 SB_DFFE 49 SB_LUT4 32 SB_RAM40_4K 6 ``` which looks very reasonable – gatecat Jul 17 '20 at 09:24
  • Thanks David. I'm not sure how to interpret ` SB_CARRY 11 SB_DFFE 49 SB_LUT4 32 SB_RAM40_4K 6 ', and why it looks reasonable. Can you explain further? I guess I really am running out of logic with this design. I have two other similar ram modules for different clock domains: would there be a way of doing all these reads at different rates with the same module? Would this be more economical logic-wise? – ke10g Jul 17 '20 at 09:34
  • For example, I can imagine scheduling all the reads for the different clock rates at the fast clock rate and distributing them to registers that then get read off at those slower rates... but as far as I can tell that might save some memory but not logic. – ke10g Jul 17 '20 at 09:36
  • 6 SB_RAM40_4Ks is the expected number of RAMs. 32 LUT4s and 49 flipflops is a fairly small amount of logic given the various storage elements and control here. – gatecat Jul 17 '20 at 10:00
  • The problem is likely elsewhere in the design. – gatecat Jul 17 '20 at 10:00
  • ok. Am I on the right track doing it this way? What else should I be watching out for (elsewhere in the design) to make this more efficient? Feel free to refer me to online resources that might help. – ke10g Jul 17 '20 at 10:16
  • variables should be defined before you use them. `leftN`, `X` and `rightN` are not. – Serge Jul 17 '20 at 11:02
  • thanks @Serge. You are right of course. I've tried initializing them to zero. But unfortunately that doesn't seem to make a difference for this case. – ke10g Jul 17 '20 at 12:16
  • Regarding the "most efficient" part of the question: If I am not mistaken you could drop the faster clock completely. Rearrange the ram (`reg ram2 [12*2048-1:0];`) and adapt the access parts (`ram2[addr_a*12+:12] <= din_a;` and `{leftN,X,rightN} <= ram2[addr_a*12+:3*12];`). This is based on the assumption that you will always read out the values next to the address and will fail if addr_a equals the end address. – Christian B. Jul 18 '20 at 10:27
  • @christian b. That is a very interesting idea and exactly the kind of thing I was fishing for. I'll give this a try. Thanks. – ke10g Jul 18 '20 at 13:05
  • what about using sub-banks in order to be accessed in the same clock cycle separately? – m4j0rt0m Jul 19 '20 at 19:32

1 Answers1

0

Regarding the efficiency the following approach should work and removes the need for a faster clock:

module myRam2 (
 input wire clk,
 input wire we,
 input wire re,
 input wire [10:0] addr_a,
 input wire [10:0] addr_b,
 input wire [11:0] din_a,
 output reg [11:0] leftNeighbor,
 output reg [11:0] currentX,
 output reg [11:0] rightNeighbor
);

reg [11:0] ram2 [2047:0];/* synthesis syn_ramstyle = "no_rw_check" */;

always @(posedge clk) begin
    if(we)  ram2[addr_a]                            <= din_a;
    if(re)  {leftNeighbor,currentX,rightNeighbor}   <= {ram2[addr_b-1],ram2[addr_b],ram2[addr_b+1]};
end

endmodule

The synthesis keyword helped me in the past to increase the likelyhood of correctly inferred ram.

EDIT: removed second example suggesting a 1D mapping. It turned out that at least Lattice LSE cannot deal with that approach. However the first code snipped should work according to Active-HDL and Lattice LSE.

Christian B.
  • 816
  • 6
  • 11
  • I have not been able to get it working in 1d unfortunately. I'm on Yosys, and was getting an error about my read address width not being constant. I got around the verification error by writing the address * 12 to a register and using that reg -1 as an the read address and stating 36 as the width. But the resulting code fails to build. It stalls at PROC_RMDEAD. – ke10g Jul 19 '20 at 12:17
  • I've actually now started having better results (saving a lot of logic resources) with a "multipumped" design, where I use just one block ram instead of three mirrored ones (for each clock domain) and time-multiplex the read ports at a fast clock rate. This builds and with much fewer resources but I am running into timing issues. I am just using a fast clock that iterates through the read addresses in a case statement, and checks if the given address has changed, and if so updates a register with the value at that ram address. – ke10g Jul 19 '20 at 12:36
  • ...but I think these registers get updated to late or something. Perhaps I should set up some kind of request-handshake between the modules? I am not sure how to overcome these timing issues. – ke10g Jul 19 '20 at 12:39
  • Thank you for the feedback. First I missed to add a plus sign in the second code and after testing I realized that at least Lattice LSE does not like big 1D arrays and deals better with the 2D/memory like mapping. However the first example works according to the tests I have done. Could you elaborate on the "three mirrored" part? I was under the expression that the first example should fit your requirements but maybe I am unaware of the full set of requirements. – Christian B. Jul 19 '20 at 13:13
  • Thanks for helping Christian. The initial problem with the code was not that it was not building, but that it was not building in my design due to overrunning logic resource budget. In my initial design I was writing to two buffers simultaneously: one for an lcd and another for a dac, read off at different clock rates. I then added a third module (the one I posted above) to do some math at a fast clock rate which would then be used to feed the results to both other buffers. It is here that things broke down (running out of logic). – ke10g Jul 19 '20 at 13:47
  • So now I'm trying to run all of these reads from the same block ram. I have 5 in total: leftN, X, rightN, (at the math clock speed), lcd_read, and dac read, both happening at their rates. I have a fast clock iterating through these 5 address requests, only updating the register when the address in question has changed. But, timing issues... the test signals I'm using come out out of phase. Still, I have a feeling that this is promising, if I can only get the timing right. I've tried various double-floping of input addresses etc... but to no avail. – ke10g Jul 19 '20 at 13:51
  • Do you really need to run DAC, logic (math) and LCD on different clocks or cannot you unify them into one? Or at least use derived clocks (e.g. multiplied or divided) to avoid clock domain crossing? Is there the possiblity to reduce the size of the used buffers to free up resources? – Christian B. Jul 19 '20 at 13:55
  • With the current multi-pumped setup, my resources are around 60 percent logic 55 percent ram. So it looks good in that regard. I just need to figure out how to get the timing right. The lcd is at a constant fraction of the speed of the buffer write, however the stuff going to the dac is on an independent counter that is variable length. – ke10g Jul 19 '20 at 14:04
  • and definitely the math needs to run at a much higher clock rate in order to work. There is a bunch of math that needs to happen between write clock cycles. – ke10g Jul 19 '20 at 14:05
  • to be clear though, in this multipumped set up, the reads are all being performed at a fifth of the fast clock speed, since my counter cycles through them, effectively creating time-multiplexed read ports... I just don't understand why the reads come out of timing like this, since they should only update when the requested address change. – ke10g Jul 19 '20 at 14:07