1

I have determined the following is my critical path (SHA1 algorithm)

h0 <= h0 + A;
h1 <= h1 + B;
h2 <= h2 + C;
h3 <= h3 + D;
h4 <= h4 + E;

When I commented this section of the code out, my Fmax would be around 300Mhz, but with it, it is around 100Mhz. Am I correct to assume the feedback loop is causing the Fmax to drop? Or maybe there are some other reasons? What are some of the strategies to mitigate this problem and how can I implement them?

Sugihara
  • 1,091
  • 2
  • 20
  • 35
  • Is this in an `always @(posedge clk` ? I would be surprised in an addition was the limiting factor. – Morgan Mar 08 '15 at 09:22

2 Answers2

2

More code, as well as specific information about the architecture you are synthesizing to is needed for a more accurate answer.

As for your last question, the standard strategy is to minimize the amount of combinational logic between registers. That is, to apply a pipeline. For your code, this would be used as this:

t0 <= h0 + A;
t1 <= h1 + B;
t2 <= h2 + C;
t3 <= h3 + D;
t4 <= h4 + E;
h0 <= t0;
h1 <= t1;
h2 <= t2;
h3 <= t3;
h4 <= t4;

But for just an addition, I doubt there is an improvement here. Recall that the combinational block (that is, the adder) is still there.

Let's assume that your target architecture cannot implement fast large adders, just fast small adders, and your registers are very wide. Then you could split each large addition into small parallel additions that could be performed using available resources, but I doubt that the compiler doesn't do that by itself.

reg [127:0] a,b,c;

always @(posedge clk)
  a <= b + c

Becomes:

reg [63:0] ah,al;
reg cy;
reg [127:0] a,b,c;

always @(posedge clk) begin
  {cy,al} <= b[63:0] + c[63:0];
  ah <= b[127:64] + c[127:64];
  a <= {ah+cy, al};
end
mcleod_ideafix
  • 11,128
  • 2
  • 24
  • 32
0

Looks like you are implementing this algorithm that generates a, b, c, d, and e in a loop which iterates 80 times. If so, you need to be aware that synthesis tools unroll loops. That means if you try to create a, ..., e in a combination block, the entire 80 iterations of the loop will be unrolled to be synthesized in a single clock, which would create a lot of adders and creates a very long datapath as well as a very high area. I suspect this is the reason that your frequency degrades.

This can be solved by pipe-lining as mcleod_ideafix mentioned, but as a simpler solution without pipelining you can just create a state machine that partially executes the loop in multiple cycles. For example, just by introducing one state bit, you can calculate the loop from 0 to 39 in one clock, store the partial results, and then calculate 40 to 79 in the next clock. This way the entire algorithm will take 2 clock cycles, and you would use half of the adder area.

You can extend this by calculating each iteration of the loop in a single clock cycle. Use a counter variable to count the iterations and calculate partial a, ..., e. When the counter hits 80, calculate h0 to h4. This way the entire algorithm will take 80 clock cycles: The classic trade-off between the latency, area, and the frequency.

Ari
  • 7,251
  • 11
  • 40
  • 70