I have riscv processor and extension processor which is programmable. In other words, the extension has its own unique ISA.
I will insert instruction to this extension to perform convolution by the program which runs at the riscv processor. In other words, If I run following codes, riscv processor insert lw instruction to this extension at every iteration. k corresponds to the register number and kernel_pointer is the address of the SRAM.
for(int k = 0; k < input_channel_size; k++){lw_encode(k, kernel_pointer);kernel_pointer++;}
This extension has its own DRAM and SRAM which have to be manually managed. I have difficulty in dealing with memory management. In my situation, DRAM capacity is infinite and SRAM capacity is 1024 words. I can access DRAM with only 32 bytes aligned access and its access have to transfer multiple of 32 bytes.
Given that each pixel is 32bits, As you know, since in almost situation, the whole image and kernel cannot be fetched into the SRAM so I have to fetch image only small amounts and calculate with iteration. But since DRAM only permit access of 32 bytes aligned address, not 32 bits aligned address, and since it have to fetch 32 bytes, not 32 bits, in many cases I have to fetch unnecessary or next operand pixels.
For the address of SRAM, I will assign address from 0 ~ 255 to input image and 256 ~ 511 to filter and 512 ~ 755 to destination address and I want to reserve 756~1023 for further extension. Of course this addressing is arbitary and can be changed tentatively.
In this situation, For example, let image[32][32][6] and filter[3][3][6] and if I consider multiplication and addition in terms of filter[0][0][0:6], I have to fetch image[0:30][0:30][0:6]. And since least size of fetched image pixel is 8, it is impossible to image[0][0] since it has only 6 pixels, which means 2 pixels is unnecessary. I'm trying to resolve this issue by introducing of circular queue concept but not sure how to handle this situation. And what if the the pixels out of scope is fetched? For example, image[31][31][0:6] might be fetched due to the property of DRAM Access.
Here is the code snippet for ease of understanding my situation. I think this code won't work, since it tries to access DRAM Address which is not 32 byte aligned.
if(input_channel_size < 16){ // weight_stationary_case
dram_read_wrapping(kernel_pointer, dram_kernel, 9 * input_channel_size, k_corner, false);
for(int m = 0; m < 9; m++){
for(int k = 0; k < input_channel_size; k++){
lw_encode(k, kernel_pointer);
kernel_pointer++;
}
for(int i = 0; i < size_of_output * size_of_output;){
f_num = dram_read_wrapping(feature_pointer, dram_feature, input_channel_size, f_corner, true);
for(int j = 0; j < f_num; j++){
if(((j + 1) % size_of_output) == 0){
feature_pointer += 2 * input_channel_size;
feature_pointer = feature_pointer % 0x100;
destination_end += 2;
if(destination_end >= 0x300) dram_write_wrapping(destination_start, dram_result, destination_end, d_corner);
j++;
continue;
}
for(int k = 0; k < input_channel_size; k++){
lw_encode(k, feature_pointer);
feature_pointer++;
feature_pointer = feature_pointer % 0x100 ;
}
// mac_reserve();
for(int k = 0; k < 3; k ++){
mac_encode(31, 1 + k, 4 + k);
}
// mac_reserve();
sw_encode(31, destination_end++);
if(destination_address == 0x300) dram_write_wrapping(destination_start, dram_result, destination_end, d_corner);
}
i += f_num;
}
if(m == 2 || m == 5) dram_feature += 2 * input_channel_size;
else dram_feature++;
}
}
Is there any idea which is helpful to this situation? Thanks. Also, If there is non-clear context, please let me know so that I update the description.