Optimizing the Vivado HLS code to reduce the latency for image processing algorithm

Question

I am trying to implement an image processing algorithm for a gamut mapping filter for Hardware using Vivado HLS. I have created a synthesizable version from a Halide code. But it is taking way too long for an image of (256x512) it was taking around 135 seconds which shouldn't be the case. I have used some optimizing techniques like pipelining the innermost loop, By pipelining, I have set the target(initiation interval) of II=1 for the innermost loop but the acheived II is 6. From the warnings thrown by the compiler, I have understood that it is because of accessing of the weights like ctrl_pts & weights, From the tutorials, I have seen, using array partitioning and array reshaping would help with the faster accessing of the weights. I have shared the code I have used to synthesize below:

//header
include "hls_stream.h"
#include <ap_fixed.h>
//#include <ap_int.h>
#include "ap_int.h"
typedef ap_ufixed<24,24> bit_24;
typedef ap_fixed<11,8> fix;
typedef unsigned char uc;
typedef ap_uint<24> stream_width;
//typedef hls::stream<uc> Stream_t;

typedef hls::stream<stream_width> Stream_t;
struct pixel_f
{
    float r;
    float g;
    float b;
};

struct pixel_8
{
    uc r;
    uc g;
    uc b;
};
void gamut_transform(int rows,int cols,Stream_t& in,Stream_t& out, float ctrl_pts[3702][3],float weights[3702][3],float coefs[4][3],float num_ctrl_pts);


//core
//include the header
#include "gamut_header.h"
#include "hls_math.h"
void gamut_transform(int rows,int cols, Stream_t& in,Stream_t& out, float ctrl_pts[3702][3],float weights[3702][3],float coefs[4][3],float num_ctrl_pts)
{
#pragma HLS INTERFACE axis port=in
#pragma HLS INTERFACE axis port=out
//#pragma HLS INTERFACE fifo port=out
#pragma HLS dataflow
pixel_8 input;
pixel_8 new_pix;
bit_24 temp_in,temp_out;
pixel_f buff_1,buff_2,buff_3,buff_4,buff_5;
float dist;

for (int i = 0; i < 256; i++)
{
    for (int j = 0; i < 512; i++)
    {
        temp_in = in.read();
        input.r = (temp_in & 0xFF0000)>>16;
        input.g = (temp_in & 0x00FF00)>>8;
        input.b = (temp_in & 0x0000FF);

        buff_1.r = ((float)input.r)/256.0;
        buff_1.g = ((float)input.g)/256.0;
        buff_1.b = ((float)input.b)/256.0;
        
        for(int idx =0; idx < 3702; idx++)
        {
                
        buff_2.r = buff_1.r - ctrl_pts[idx][0];
        buff_2.g = buff_1.g - ctrl_pts[idx][1];
        buff_2.b = buff_1.b - ctrl_pts[idx][2];
        
        dist = sqrt((buff_2.r*buff_2.r)+(buff_2.g*buff_2.g)+(buff_2.b*buff_2.b));
        
        buff_3.r = buff_2.r + (weights[idx][0] * dist);
        buff_3.g = buff_2.g + (weights[idx][1] * dist);
        buff_3.b = buff_2.b + (weights[idx][2] * dist);
        }
    buff_4.r = buff_3.r + coefs[0][0] + buff_1.r* coefs[1][0] + buff_1.g * coefs[2][0] + buff_1.b* coefs[3][0];
    buff_4.g = buff_3.g + coefs[0][1] + buff_1.r* coefs[1][1] + buff_1.g * coefs[2][1] + buff_1.b* coefs[3][1];
    buff_4.b = buff_3.b + coefs[0][2] + buff_1.r* coefs[1][2] + buff_1.g * coefs[2][2] + buff_1.b* coefs[3][2];
    
    buff_5.r = fmin(fmax((float)buff_4.r, 0.0), 255.0);
    buff_5.g = fmin(fmax((float)buff_4.g, 0.0), 255.0);
    buff_5.b = fmin(fmax((float)buff_4.b, 0.0), 255.0);
    
    new_pix.r = (uc)buff_4.r;
    new_pix.g = (uc)buff_4.g;
    new_pix.b = (uc)buff_4.b;

    temp_out = ((uc)new_pix.r << 16 | (uc)new_pix.g << 8 | (uc)new_pix.b);

    out<<temp_out;
    }
}
}

Even with the achieved II=6, the time taken is around 6 seconds; The given target is to have the time taken in milliseconds. I tried to do pipelining for the second most inner loop, but I am running out of resources on my board when I do that as the third most inner loop is being unrolled. I am using zynq ultra-scale board which has a fair amount of resources. Any suggestions on optimizing the code will be highly appreciated.

Also, can anyone suggest what type of interface would be best suited for ctrl_pts,weights and coefs, For reading the image I understood that streaming interface helps, and for reading small values like the number of rows and columns, Axi lite is preferred? Is there a type of interface that I can use for the mentioned variables so that it can go hand in hand with array partitioning and array reshaping?

Any suggestions will be highly appreciated,

Thanks in advance

Edit: I understand that the fixed-point representation can bring down the latency further, But my first goal is to get the floating-point representation with the best result and then analyzing the performance with fixed point representation

Stefano Ribes · Accepted Answer · 2021-03-15T16:09:50.123

There are some steps you can do to optimize your design, but bear in mind that if you really need a floating square root operation, that will most likely have a huge latency penalty (unless properly pipelined, of course).

Your code might have a typo in the second inner loop: the index should be j right?

Data Locality

First off: ctrl_pts is read multiple time from the main memory (I assume). Since it's reused 256x512 times, it would be better to store it into a local buffer on the FPGA (like a BRAM, but it can be inferred), like so:

  for(int i =0; i < 3702; i++) {
    for (int j = 0; j < 3; ++j) {
#pragma HLS PIPELINE II=1
      ctrl_pts_local[i][j] = ctrl_pts[i][j];
    }
  }

  for (int i = 0; i < 256; i++) {
    for (int j = 0; i < 512; i++) {
      // ...
      buff_2.r = buff_1.r - ctrl_pts_local[idx][0];
      // ...

Same reasoning goes for coefs and weights, just store them in a local variable before running the rest of the code. To access the arguments you can use a master AXI4 interface m_axi and configure it accordingly. Once the algorithm is dealing with the local buffers, HLS should be able to automatically partition the buffers accordingly. If not, you can place the ARRAY_PARTITION complete dim=0 pragmas to force it.

Dataflow

Because of the way your algorithm works, another thing you could try is to break down the main loops (256x512) into three smaller processes running in dataflow, and so in parallel (+3 if you include the setup ones)

The whole code will look something like this (I hope it renders correctly):

[Compute buff_1]-->[FIFO1]-->[compute buff_3]-->[FIFO2a]-->[compute buff_4 and buff_5 + stream out]
              L-------------------------------->[FIFO2b]----^

One tricky thing would be to stream buff_1 to both the next processes.

Possible Code

I won't try this code, so there might be compilations errors along the way, but the whole accelerator code would look something like this:

  for(int i =0; i < 3702; i++) {
    for (int j = 0; j < 3; ++j) {
#pragma HLS PIPELINE II=1
      ctrl_pts_local[i][j] = ctrl_pts[i][j];
      weights_local[i][j] = weights[i][j];
    }
  }

  for(int i =0; i < 4; i++) {
    for (int j = 0; j < 3; ++j) {
#pragma HLS PIPELINE II=1
      coefs_local[i][j] = coefs[i][j];
    }
  }

  Process_1:
  for (int i = 0; i < 256; i++) {
    for (int j = 0; i < 512; i++) {
#pragma HLS PIPELINE II=1
      temp_in = in.read();
      input.r = (temp_in & 0xFF0000)>>16;
      input.g = (temp_in & 0x00FF00)>>8;
      input.b = (temp_in & 0x0000FF);

      buff_1.r = ((float)input.r)/256.0;
      buff_1.g = ((float)input.g)/256.0;
      buff_1.b = ((float)input.b)/256.0;
      fifo_1.write(buff_1); // <--- WRITE TO FIFOs
      fifo_2b.write(buff_1);
    }
  }

  Process_2:
  for (int i = 0; i < 256; i++) {
    for (int j = 0; i < 512; i++) {
      for(int idx =0; idx < 3702; idx++) {
#pragma HLS LOOP_FLATTEN // <-- It shouldn't be necessary, since the if statements already help
#pragma HLS PIPELINE II=1 // <-- The PIPELINE directive can go here
        if (idx == 0) {
          buff_1 = fifo_1.read(); // <--- READ FROM FIFO
        }
        buff_2.r = buff_1.r - ctrl_pts_local[idx][0];
        buff_2.g = buff_1.g - ctrl_pts_local[idx][1];
        buff_2.b = buff_1.b - ctrl_pts_local[idx][2];
        
        dist = sqrt((buff_2.r*buff_2.r)+(buff_2.g*buff_2.g)+(buff_2.b*buff_2.b));
        
        buff_3.r = buff_2.r + (weights_local[idx][0] * dist);
        buff_3.g = buff_2.g + (weights_local[idx][1] * dist);
        buff_3.b = buff_2.b + (weights_local[idx][2] * dist);
        if (idx == 3702 - 1) {
          fifo_2a.write(buff_3); // <-- WRITE TO FIFO
        }
      }
    }
  }

  Process_3:
  for (int i = 0; i < 256; i++) {
    for (int j = 0; i < 512; i++) {
#pragma HLS PIPELINE II=1
      buff_3 = fifo_2a.read(); // <--- READ FROM FIFO
      buff_1 = fifo_2b.read(); // <--- READ FROM FIFO
      buff_4.r = buff_3.r + coefs_local[0][0] + buff_1.r* coefs_local[1][0] + buff_1.g * coefs_local[2][0] + buff_1.b* coefs[3][0];
      buff_4.g = buff_3.g + coefs_local[0][1] + buff_1.r* coefs_local[1][1] + buff_1.g * coefs_local[2][1] + buff_1.b* coefs_local[3][1];
      buff_4.b = buff_3.b + coefs_local[0][2] + buff_1.r* coefs_local[1][2] + buff_1.g * coefs_local[2][2] + buff_1.b* coefs_local[3][2];
      
      buff_5.r = fmin(fmax((float)buff_4.r, 0.0), 255.0);
      buff_5.g = fmin(fmax((float)buff_4.g, 0.0), 255.0);
      buff_5.b = fmin(fmax((float)buff_4.b, 0.0), 255.0);
      
      new_pix.r = (uc)buff_4.r;
      new_pix.g = (uc)buff_4.g;
      new_pix.b = (uc)buff_4.b;

      temp_out = ((uc)new_pix.r << 16 | (uc)new_pix.g << 8 | (uc)new_pix.b);

      out<<temp_out;
    }
  }

Be extremely careful in sizing the depth of the FIFOs, since Process 2 (the one with the sqrt operation) might have a slower data consumption and production rates! Also, FIFO 2b needs to account for that delay. If the rates are not matching, there will be deadlocks. Make sure to have a meaningful testbench and to cosimulate your design. (The FIFO's depth can be changed with the pragma #pragma HLS STREAM variable=fifo_1 depth=N).

Final Thoughs

There might be further smaller/detailed optimizations that can be performed along the way, but I would first start from the ones above, being the heaviest. Just bear in minds that floating point processing is not optimal on FPGAs (as you noted) and is usually avoided.

EDITs: I tried the code with the modifications above and I've achieved II=1 with decent resouce usage.

Since the II is now one, the ideal number of cycles the accelerator would take is 256x512 and I'm close to that: ideal 402,653,184 versus mine 485,228,587). One crazy idea I now have to propose to you is to split the Process_2 inner-most loop into two parallel branches (or even more than 2 actually), feeding their own FIFOs. Process_1 would supply the two branches while an additional process/loop will alternatively read from the two FIFOs the 256x512 elements and supply them in the correct order to Process_3. In this way, the total amount of cycles required should halve, since Process_2 is the slowest process in the dataflow (and so improving it will improve the whole design). One possible drawback of this approach will be a higher amount of area/resource required on the FPGA.

Good luck.

Thanks a lot for the detailed response, Apologies for not responding earlier as I was out of the station, I will try to implement what you suggested in the edit part. Thank you very much. — mutyala mahesh, Mar 29 '21 at 07:02
hello Stefano, I am relatively new to programming in vivado hls, I have understood what you have explained theoretically, Can you explain how you initiated the fifo in your case, I have looked at the examples on how to do it, and its a bit confusing between the streams and fifo's. If you still have the code with the above-mentioned ideas where you acheived II=1, is it possible for you to share it so that I can run it on my side and have a better understanding of this — mahesh mutyala, Aug 05 '21 at 07:07
What do you mean by "initiate a FIFO"? A FIFO is initially empty and the consumer cannot read/pop from the FIFO before a producer has written/pushed data into the FIFO. A FIFO can be implemented in hardware in many ways (shift registers, LUTs, BRAMs, etc...), and Vivado HLS abstracts the hardware push/pop methods with a C++ class named hls::stream<>. — Stefano Ribes, Aug 05 '21 at 15:02
Hello @StefanoRibes, while trying to implement the data locality part you suggested for caching, its giving me the error specified at https://forums.xilinx.com/t5/High-Level-Synthesis-HLS/Call-parameter-does-not-match-function-signature/td-p/1152464, did you run into this issue? unfortunately no answer was given on the forum wher the question was asked. Also can you please tell me on how you chose the depth in the fifo's? whenever i try to break it into three processes its running into an infinite loop with the streams, I think thats the issue you warned me about with prod and consumpt rates. — mutyala mahesh, Sep 10 '21 at 10:33
I looked for some standard examples of such implementation but I couldn't find any. Is it possible for me to have a look at the synthesizable version of the code you have tried which achieved II=1 for a better understanding? Thank you — mutyala mahesh, Sep 10 '21 at 10:42
For the FIFO depth, you can roughly estimate it based on the latency of the dataflow processes and their II. Try to think how many cycles pass from when the data exits [Compute buff_1] to when it enters [compute buff_4 and buff_5 + stream out], passing through FIFO2a. The cycles amount it's gonna be, roughly, the latency of [compute buff_3] (let it be L) times II. So this means that the FIFO2b should accept and accomodate at least L*II elements, roughly, before being consumed by process [compute buff_4 and buff_5 + stream out]. — Stefano Ribes, Sep 10 '21 at 14:16

Optimizing the Vivado HLS code to reduce the latency for image processing algorithm

1 Answers1

Data Locality

Dataflow

Possible Code

Final Thoughs