0

I am trying to write a verilog code to implement census transform on an image of 640X480 pixels.I wrote the complete code in behavioral form. But the code is taking too long to synthesize. I understand that the reason might be the long register arrays and loops but I am not sure how to handle that.
Here is my code:

module test(in,clk,out
    );
    input clk;
    input [7:0] in;
    output  [119:0]out;
    reg [7:0]matrix[0:639][0:479];
    //reg [119:0]win[0:10][0:10];
    reg [9:0] i = 0;
    reg [8:0] j = 0;
    reg [12:0] count = 0;
    integer p,q = 6;
    integer a,b = -6;
    reg [119:0]censusTransformedImage;
    reg [119:0]census=0;
    always@ (posedge clk)
    begin
        if(count<=6411)
            count = count+1;
    end
    always @ (posedge clk )
    begin
        if(i<=639)
        begin
            matrix[i][j]=in;
            i=i+1;
        end
        else if(i==639 && j<=479)
        begin
            i=0;
            j=j+1;
        end
        //end
    end

    always @ (posedge clk)
    begin
        if(count > 6411)
        begin
            if(p<=634)
            begin
                if(q<=479)
                begin
                    //census = 0;
                    if(a<=6)
                    begin
                        if(b<=6)
                        begin
                            if(~(a==0 && b==0))
                            census=census<<1;
                            if (matrix[p+a][q+b] > matrix[p][q])
                        census=census+1;
                                b = b+1;
                        end
                        else
                        begin
                            b=-6;
                            a=a+1;
                        end 
                    end
                    else
                    begin
                        censusTransformedImage=census;
                        census=0;
                        a=-6;
                        q=q+1;
                    end
                end
                else
                begin
                    q=0;
                    p=p+1;
                end
            end
        end
    end
   assign out = censusTransformedImage;
endmodule
Greg
  • 18,111
  • 5
  • 46
  • 68
  • The window size of census is 11X11. – Utkarsh Jain Nov 02 '15 at 18:47
  • Have you actually simulated this code? There is no way that it actually does what you want. So many problems, starting with: every element of `matrix` is going to be equal to `in` on every clock tick. – nguthrie Nov 02 '15 at 22:07
  • thanx @nguthire, that was a blunder of mine and i have edited the code accordingly. But the problem is still there. It is taking too long to synthesize. – Utkarsh Jain Nov 03 '15 at 07:05
  • You've probably violated some coding guidelines in your synthesis tool. Please refer to http://stackoverflow.com/questions/7565095/how-can-i-know-if-my-code-is-synthesizable-verilog – e19293001 Nov 03 '15 at 08:07
  • Read the documentation that comes with whatever synthesis tool you are going to be using. - Martin Thompson – e19293001 Nov 03 '15 at 08:11
  • code is working fine when i am reducing the size of image considerably. so i am assuming the main trouble is the size of [7:0]matrix[0:639][0:479]. But i have no idea how to handle this. And i am using xilinx ise 14.7. @e19293001 – Utkarsh Jain Nov 03 '15 at 08:46
  • You might need to separate your code into two parts: memory unit and processing unit. – e19293001 Nov 03 '15 at 08:53
  • @e19293001 I tried running the code with processing unit commented but the problem still persists. Actually it is unable to synthesize the **matrix** of this size. – Utkarsh Jain Nov 03 '15 at 11:53
  • 1
    You should be using non-blocking assignments (`<=`) in your `always@(posedge)` block, not blocking assignments (`=`). – wilcroft Nov 03 '15 at 19:05

1 Answers1

0

The synthesizer is likely to try to implement your matrix as distributed memory. That is, to use flip flops taken from the slices of the FPGA. This has to be avoided, because you would exhaust nearly all the resources of your FPGA device only to implement that piece of memory.

Instead, design your matrix memory as an independent module, with one input address (coordinates i,j), one 8-bit output data, and one 8-bit input data. Something like:

module matrix (
  input clk,
  input wire [9:0] i,
  input wire [8:0] j,
  input write_enable,
  input wire [7:0] din,
  output reg [7:0] dout
  );

  reg [7:0] M[0:307199]; // your 640x480 matrix
  wire [18:0] addr;

  assign addr = i*640+j; // let's hope the synthesizer is able to
                         // implement this without having to use
                         // an actual multiplication engine
                         // (it shouldn't need to)
  always @(posedge clk) begin
    if (write_enable == 1'b1)
      M[addr] <= din;
    dout <= M[addr];
  end
endmodule

The key point here is that on every clock cycle there is only one access to the matrix register (M), and both input and output data are registered. This way, the synthesizer will be able to implement this huge register with block RAM instead of distributed RAM, leveraging tons of slices, speeding up the synthesis process.

Of course, this also means that your controller has to be written in such a way that for every clock cycle, only one operation to your matrix can be performed, either read or write. You are not allowed to, for instance, read two different elements in the same clock cycle. If two different elements are needed in the same clock cycle (as your current code seems to do), rewrite this module so two sets of input coordinates are available, along with two output data ports. Hopefully, the synthesizer will infer a dual port memory block for it.

As a test, instruct the synthesizer to synthesize only the matrix module and watch for synthesis messages regarding M being implemented using block RAM, absorbing this and that register, etc, to make sure it won't be implemented using distributed RAM again.

mcleod_ideafix
  • 11,128
  • 2
  • 24
  • 32
  • If the synthesizer cannot optimally handle multiplying with a constant, then you can do it manually. `i*640+j` can be written as `{ i, 9'h000 } + { i, 7'h00 } + j`. – Greg Nov 04 '15 at 21:13
  • It is not working directly but I changed the RAMs to Block RAMs manually. Now it's working but unfortunately it is taking more than 100% resources. I guess I have to think of something else. – Utkarsh Jain Nov 05 '15 at 14:40
  • You might have to leave some paralelism out of your algorithm and try first to implement it using a more traditional (e.g. sequential) approach. See this SO question for details on how to write an algorithm as a hardware module: http://stackoverflow.com/questions/32993428/best-way-to-convert-for-loops-into-an-fpga/33066594#33066594 – mcleod_ideafix Nov 05 '15 at 14:47