I am searching for a very fast convolution function for the Raspberry Pi 2 written in ARM v7 assembly with neon (.s or instric).
If this doesn't exist (I searched for several days) any help to write it is welcome, I begun to read neon programmer's guide but it's very hard...
I tried a basic function in ARM assembly with an argument and a return value, I can call it from C++ so it works.
I tried a basic test by moving data to the neon register vld1_u8, I can retrieve itwith vst1_u8, so header and compiler is ok...
The most complicated for me is to design a function and choose the right instruction to implement it:
Data: 320x240 grey scale image (signed 8 bits per pixel)
Rate: 20 fps
matrix: contains float values from -1 to 1 (basic no factor, factor sum = 0, size 7x7 but can be extend with 0 to 8x8).
I try to do:
Transfer memory to 64 bits register:
uint8x8_t ui88Line1 = vld1_u8 ( Data + 8*0 );
Transfer data from 64 bits register to 128 bits one with 8 bits to 16 bits conversion signed.
uint16x8_t ui816Kernel1 = vmovl_u8 ( ui88Kernel1 );
For the rest I am looking for:
- Do I need to add 255 to my data instead of deal with negative value or use convert u16 to s16?
- Do I need a apply a shift of 7 ( * 64 ) to conserve floating point precision or use neon float implementation?
I really need guru help to made the better choice.
Note : I already do it in C/C++, OpenCV one is not optimized for this platform.