Assembly Convolution with manual memory management

Question

I have riscv processor and extension processor which is programmable. In other words, the extension has its own unique ISA.

I will insert instruction to this extension to perform convolution by the program which runs at the riscv processor. In other words, If I run following codes, riscv processor insert lw instruction to this extension at every iteration. k corresponds to the register number and kernel_pointer is the address of the SRAM.

for(int k = 0; k < input_channel_size; k++){lw_encode(k, kernel_pointer);kernel_pointer++;}

This extension has its own DRAM and SRAM which have to be manually managed. I have difficulty in dealing with memory management. In my situation, DRAM capacity is infinite and SRAM capacity is 1024 words. I can access DRAM with only 32 bytes aligned access and its access have to transfer multiple of 32 bytes.

Given that each pixel is 32bits, As you know, since in almost situation, the whole image and kernel cannot be fetched into the SRAM so I have to fetch image only small amounts and calculate with iteration. But since DRAM only permit access of 32 bytes aligned address, not 32 bits aligned address, and since it have to fetch 32 bytes, not 32 bits, in many cases I have to fetch unnecessary or next operand pixels.

For the address of SRAM, I will assign address from 0 ~ 255 to input image and 256 ~ 511 to filter and 512 ~ 755 to destination address and I want to reserve 756~1023 for further extension. Of course this addressing is arbitary and can be changed tentatively.

In this situation, For example, let image[32][32][6] and filter[3][3][6] and if I consider multiplication and addition in terms of filter[0][0][0:6], I have to fetch image[0:30][0:30][0:6]. And since least size of fetched image pixel is 8, it is impossible to image[0][0] since it has only 6 pixels, which means 2 pixels is unnecessary. I'm trying to resolve this issue by introducing of circular queue concept but not sure how to handle this situation. And what if the the pixels out of scope is fetched? For example, image[31][31][0:6] might be fetched due to the property of DRAM Access.

Here is the code snippet for ease of understanding my situation. I think this code won't work, since it tries to access DRAM Address which is not 32 byte aligned.

        if(input_channel_size < 16){ // weight_stationary_case
        dram_read_wrapping(kernel_pointer, dram_kernel, 9 * input_channel_size, k_corner, false);
        for(int m = 0; m < 9; m++){
            for(int k = 0; k < input_channel_size; k++){
                lw_encode(k, kernel_pointer);
                kernel_pointer++;
            }
            for(int i = 0; i < size_of_output * size_of_output;){
                f_num = dram_read_wrapping(feature_pointer, dram_feature, input_channel_size, f_corner, true);
                for(int j = 0; j < f_num; j++){
                    if(((j + 1) % size_of_output) == 0){
                        feature_pointer += 2 * input_channel_size;
                        feature_pointer = feature_pointer % 0x100;
                        destination_end += 2;
                        if(destination_end >= 0x300) dram_write_wrapping(destination_start, dram_result, destination_end, d_corner);
                        j++;
                        continue;
                    }
                    for(int k = 0; k < input_channel_size; k++){
                        lw_encode(k, feature_pointer);
                        feature_pointer++;
                        feature_pointer = feature_pointer % 0x100 ;
                    }
                    // mac_reserve();
                    for(int k = 0; k < 3; k ++){
                        mac_encode(31, 1 + k, 4 + k);
                    }
                    // mac_reserve();
                    sw_encode(31, destination_end++);
                    if(destination_address == 0x300) dram_write_wrapping(destination_start, dram_result, destination_end, d_corner);
                }
                i += f_num;
            }
            if(m == 2 || m == 5) dram_feature += 2 * input_channel_size;
            else dram_feature++;
        }
    }

Is there any idea which is helpful to this situation? Thanks. Also, If there is non-clear context, please let me know so that I update the description.

Can you be more specific? Are you supporting a call stack? What algorithm are you writing in assembly? What is your goal? What do you mean by corner? — Erik Eidt, Jul 18 '20 at 14:44
I update the description to be more specific. Can you please read it again? Thank you. — laurent01, Jul 19 '20 at 03:26
I would experiment with your access patterns, with the following goals: (1) determine what should be cached in SRAM vs. what should not, (2) how might the algorithm be modified so that it can work on tiles that promote caching in SRAM rather than as it is now. In regards to (1) you should be able to take the algorithm you have now and superimpose caching strategies so you can measure the differences in them, and, in regard to (2), try to identify changes in the ordering of the algorithm procession such that it works more on things already available in SRAM. — Erik Eidt, Jul 19 '20 at 04:07
You can google "cache blocking" aka "loop tiling" for more about such techniques. — Peter Cordes, Jul 19 '20 at 04:34
@PeterCordes https://software.intel.com/content/www/us/en/develop/articles/cache-blocking-techniques.html this article which come with googling cache blocking really helpful for me. I love this kind of comment — laurent01, Jul 20 '20 at 00:17

score 0 · Accepted Answer · answered Jul 21 '20 at 00:26

Basically I deal with this problem with the method described above.

For DRAM read case, If there is unaligned access, I read whole things across the boundary which means I can fetch up to three blocks to the SRAM. And I use that with cut off pointer.

For DRAM write case, If there is unaligned access, I read the corresponding block to the SRAM and I contiguously write things that I want to SRAM and write to the DRAM.

This technique make the problem work and 'Cache Blocking' and 'Loop Tiling' make the program faster, which is mentioned by @PeterCordes.

Assembly Convolution with manual memory management

1 Answers1