1

I am searching for a very fast convolution function for the Raspberry Pi 2 written in ARM v7 assembly with neon (.s or instric).

If this doesn't exist (I searched for several days) any help to write it is welcome, I begun to read neon programmer's guide but it's very hard...

I tried a basic function in ARM assembly with an argument and a return value, I can call it from C++ so it works.

I tried a basic test by moving data to the neon register vld1_u8, I can retrieve itwith vst1_u8, so header and compiler is ok...

The most complicated for me is to design a function and choose the right instruction to implement it:

Data: 320x240 grey scale image (signed 8 bits per pixel)
Rate: 20 fps
matrix: contains float values from -1 to 1 (basic no factor, factor sum = 0, size 7x7 but can be extend with 0 to 8x8).

I try to do:

  1. Transfer memory to 64 bits register:

    uint8x8_t ui88Line1 = vld1_u8 ( Data + 8*0 );
    
  2. Transfer data from 64 bits register to 128 bits one with 8 bits to 16 bits conversion signed.

    uint16x8_t ui816Kernel1 = vmovl_u8 ( ui88Kernel1 );
    

For the rest I am looking for:

  • Do I need to add 255 to my data instead of deal with negative value or use convert u16 to s16?
  • Do I need a apply a shift of 7 ( * 64 ) to conserve floating point precision or use neon float implementation?

I really need guru help to made the better choice.

Note : I already do it in C/C++, OpenCV one is not optimized for this platform.

escdr
  • 11
  • 3
  • At least show the C code you want to vectorise - that would also help implicitly answer some of the missing details, like what the output format needs to be and how you want to handle boundary conditions. Float vs. fixed point really depends on how much you value accuracy vs. speed. There is no right answer. – Notlikethat Sep 05 '16 at 09:42

0 Answers0