I am trying to implement an image processing algorithm for a gamut mapping filter for Hardware using Vivado HLS. I have created a synthesizable version from a Halide code. But it is taking way too long for an image of (256x512) it was taking around 135 seconds which shouldn't be the case. I have used some optimizing techniques like pipelining the innermost loop, By pipelining, I have set the target(initiation interval) of II=1 for the innermost loop but the acheived II is 6. From the warnings thrown by the compiler, I have understood that it is because of accessing of the weights like ctrl_pts & weights, From the tutorials, I have seen, using array partitioning and array reshaping would help with the faster accessing of the weights. I have shared the code I have used to synthesize below:
//header
include "hls_stream.h"
#include <ap_fixed.h>
//#include <ap_int.h>
#include "ap_int.h"
typedef ap_ufixed<24,24> bit_24;
typedef ap_fixed<11,8> fix;
typedef unsigned char uc;
typedef ap_uint<24> stream_width;
//typedef hls::stream<uc> Stream_t;
typedef hls::stream<stream_width> Stream_t;
struct pixel_f
{
float r;
float g;
float b;
};
struct pixel_8
{
uc r;
uc g;
uc b;
};
void gamut_transform(int rows,int cols,Stream_t& in,Stream_t& out, float ctrl_pts[3702][3],float weights[3702][3],float coefs[4][3],float num_ctrl_pts);
//core
//include the header
#include "gamut_header.h"
#include "hls_math.h"
void gamut_transform(int rows,int cols, Stream_t& in,Stream_t& out, float ctrl_pts[3702][3],float weights[3702][3],float coefs[4][3],float num_ctrl_pts)
{
#pragma HLS INTERFACE axis port=in
#pragma HLS INTERFACE axis port=out
//#pragma HLS INTERFACE fifo port=out
#pragma HLS dataflow
pixel_8 input;
pixel_8 new_pix;
bit_24 temp_in,temp_out;
pixel_f buff_1,buff_2,buff_3,buff_4,buff_5;
float dist;
for (int i = 0; i < 256; i++)
{
for (int j = 0; i < 512; i++)
{
temp_in = in.read();
input.r = (temp_in & 0xFF0000)>>16;
input.g = (temp_in & 0x00FF00)>>8;
input.b = (temp_in & 0x0000FF);
buff_1.r = ((float)input.r)/256.0;
buff_1.g = ((float)input.g)/256.0;
buff_1.b = ((float)input.b)/256.0;
for(int idx =0; idx < 3702; idx++)
{
buff_2.r = buff_1.r - ctrl_pts[idx][0];
buff_2.g = buff_1.g - ctrl_pts[idx][1];
buff_2.b = buff_1.b - ctrl_pts[idx][2];
dist = sqrt((buff_2.r*buff_2.r)+(buff_2.g*buff_2.g)+(buff_2.b*buff_2.b));
buff_3.r = buff_2.r + (weights[idx][0] * dist);
buff_3.g = buff_2.g + (weights[idx][1] * dist);
buff_3.b = buff_2.b + (weights[idx][2] * dist);
}
buff_4.r = buff_3.r + coefs[0][0] + buff_1.r* coefs[1][0] + buff_1.g * coefs[2][0] + buff_1.b* coefs[3][0];
buff_4.g = buff_3.g + coefs[0][1] + buff_1.r* coefs[1][1] + buff_1.g * coefs[2][1] + buff_1.b* coefs[3][1];
buff_4.b = buff_3.b + coefs[0][2] + buff_1.r* coefs[1][2] + buff_1.g * coefs[2][2] + buff_1.b* coefs[3][2];
buff_5.r = fmin(fmax((float)buff_4.r, 0.0), 255.0);
buff_5.g = fmin(fmax((float)buff_4.g, 0.0), 255.0);
buff_5.b = fmin(fmax((float)buff_4.b, 0.0), 255.0);
new_pix.r = (uc)buff_4.r;
new_pix.g = (uc)buff_4.g;
new_pix.b = (uc)buff_4.b;
temp_out = ((uc)new_pix.r << 16 | (uc)new_pix.g << 8 | (uc)new_pix.b);
out<<temp_out;
}
}
}
Even with the achieved II=6, the time taken is around 6 seconds; The given target is to have the time taken in milliseconds. I tried to do pipelining for the second most inner loop, but I am running out of resources on my board when I do that as the third most inner loop is being unrolled. I am using zynq ultra-scale board which has a fair amount of resources. Any suggestions on optimizing the code will be highly appreciated.
Also, can anyone suggest what type of interface would be best suited for ctrl_pts,weights and coefs, For reading the image I understood that streaming interface helps, and for reading small values like the number of rows and columns, Axi lite is preferred? Is there a type of interface that I can use for the mentioned variables so that it can go hand in hand with array partitioning and array reshaping?
Any suggestions will be highly appreciated,
Thanks in advance
Edit: I understand that the fixed-point representation can bring down the latency further, But my first goal is to get the floating-point representation with the best result and then analyzing the performance with fixed point representation