2

I was wondering which of the following designs is faster, i.e., can operate at a higher Fmax:

    -- Pipelined
    if crd_h = scan_end_h(vt)-1 then
      rst_h <= '1';
    end if;

    if crd_v = scan_end_v(vt) then
      rst_v <= '1';
    end if;

    if rst_h = '1' then
      crd_h <= 0;
      rst_h <= '0';

      if rst_v = '1' then
        crd_v <= 0;
        rst_v <= '0';
      else
        crd_v <= crd_v + 1;
      end if;
    else
      crd_h <= crd_h + 1;
    end if;

Where the loop ends are checked in the "previous" cycle and applied in the following through the rst feedback signals.

Compared to the less pipelined approach:

    -- NOT Pipelined
    if crd_h = scan_end_h(vt) then
      crd_h <= 0;

      if crd_v = scan_end_v(vt) then
        crd_v <= 0;
      else
        crd_v <= crd_v + 1;
      end if;
    else
      crd_h <= crd_h + 1;
    end if;

The idea in the first implementation is not to have the arithmetic in the comparison coupled with the one in the increment. However, on the other hand, in the second implementation both operations can be done in parallel and the result of one will MUX the other. Will that be as fast as having the MUX control bit ready from the previous cycle (in the first implementation)??

Thanks!

Ran
  • 31
  • 3
  • 1
    If by "faster" you mean higher Fmax, pipelining will give you a better result. However, the tradeoff is increased resource usage. For the example given, I would be surprised if pipelining gave you any tangible improvement in the Fmax, which probably doesn't make it worth it. The easiest way to confirm would be to synthesise the code using the two methods you have listed – gsm Feb 13 '17 at 10:09
  • 1
    See `RTL Hardware Design Using VHDL`, Pong Chu, 9.4 PIPELINED DESIGN, *Pipeline is an important technique to increase the performanceof a system. The basic idea is to overlap the processing of several tasks so that more tasks can be completed in the same amount of time. If a combinational circuit can be divided into stages, we can insert buffers (i.e., registers) at proper places and convert the circuit into a pipelined design.* Where is your pipeline? High pixel clock rates typically result in coarser timing (syncs on powers of 2 clock counts,..) to overcome Fmax limits. –  Feb 13 '17 at 13:15
  • Assuming that your code above is a clocked process and your Pipelined code with be faster and assuming a lut based design, not any bigger. If you looking to squeeze out the last 1 to 3% of hardware area, you might try out a down counter and detect the carry out of the counters as your rst_v. – Jim Lewis Feb 13 '17 at 16:47

1 Answers1

1

To start with, the reason 'faster' is not the best word to use, is that this could be interpreted 'throughput', 'latency', or 'Fmax'. These three goals might require different approaches.

Ultimately, whether you need to implement more pipelining or not should be driven by your design specification and constraints. If you only need to run at 20 MHz, set up constraints for this, and see if your design passes timing. If it does, then there's not much point putting the effort into optimising the design.

Assuming your design does not meet timing, your FPGA implementation tool should be able to produce a timing report, and this should tell you which parts of your design are the limiting factor. You can then focus on optimising these sections of your design.

More generally, to understand whether a process will benefit from pipelining from an Fmax perspective, you need to understand the underlying building block, often known as a 'slice', that the FPGA tools are going to use to implement your design. In general, if a sequential function cannot fit inside one slice, it could benefit from pipelining. Whether or not the process 'fits' will largely be determined by the number of inputs it has. Note that for a process operating with n-bit data, it may be possible to describe it as n processes that each work with 1-bit data, reducing the number of inputs for the purposes of this analysis. Also note that some types of process, for example adders, can efficiently spread over several slices by making use of dedicated interconnect between the carry chains in two or more slices. Again, you need to understand in detail the building blocks available in your FPGA device.

You have not included any signal definitions, but it looks like your process has as inputs two counters, a reset, and two parameters in the form of scan_end_h and scan_end_v. I have no way to know how wide these are, but let's assume as an example that these are 12-bit values. Your process then has 4 * 12 = 48 inputs from the counters and parameters. I would not expect a function of this many inputs to fit into one slice, therefore you could probably achieve a higher Fmax using pipelining. Your idea of pipelining the counter comparisons looks like a good one; as pointed out in the comments, your best bet is to try this out, and see what the result is by looking at the implementation timing report.

scary_jeff
  • 4,314
  • 13
  • 27
  • Sure, I meant highest Fmax achievable. And you're correct, its around 12 bits so one cannot assume a single slice. I guess I can check this by increasing the clock frequency again and again and see which fails first (otherwise, I think, I do not have the guarantee the implementation tools do their best.) – Ran Feb 13 '17 at 12:34
  • @Ran see my second paragraph. Work out what frequency you need to support your screen resolution, set a constraint, and work towards it. 'highest achievable' might not be a useful goal. – scary_jeff Feb 14 '17 at 13:15