I am trying to get familiar with Neon instructions. Both assembly and intrinsics. I usee gcc V4.8.2 hardfp I would like to use the NEON memcpy with preload accordindg to :
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.faqs/ka13544.html
I have also found this topic : ARM memcpy and alignment but this is slightly different from the official ARM page implementation.
Unfortunately I have never used .s with .c files at the same time so I need some help. My .c file looks like this:
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <math.h>
#include <time.h>
#include <stdint.h>
#include <arm_neon.h>
int main()
{
clock_t start, end; // timer variables
uint32_t i,X=100;
size_t size = 2048*32/* arbitrary */;
size_t offset = 1;
char* src = malloc(sizeof(char)*(size + offset));
char* dst = malloc(sizeof(char)*(size));
NEONCopyPLD( dst, src + offset, size );
memcpy( dst, src + offset, size );
return(0);
}
and the assembly.s file is the following:
.global NEONCopyPLD
NEONCopyPLD:
PLD [r1, #0xC0]
VLDM r1!,{d0-d7}
VSTM r0!,{d0-d7}
SUBS r2,r2,#0x40
BGE NEONCopyPLD
I compile the following program by using the instruction :
arm-linux-gnueabihf-gcc -mthumb -march=armv7-a -mtune=cortex-a9 -mcpu=cortex-a9 -mfloat-abi=hard -mfpu=neon -Ofast -fprefetch-loop-arrays assembly.s asm_pr.c -o output
and I get the following error:
potentially unexpected fatal signal 11.
CPU: 0 PID: 670 Comm: out_asm Not tainted 3.10.9-rt5+ #2
task: bf907c00 ti: bef4a000 task.ti: bef4a000
PC is at 0x4c90ce LR is at 0x852d
pc : [<004c90ce>] lr : [<0000852d>] psr: 40030030
sp : 7e958cb0 ip : 00000107 fp : 00000000
r10: 76f91000 r9 : 00000000 r8 : 00000000
r7 : 00001017 r6 : 00e85010 r5 : 00e75009 r4 : 00010001
r3 : 000f4240 r2 : 00010000 r1 : 00e75009 r0 : 00e85010
Flags: nZcv IRQs on FIQs on Mode USER_32 ISA Thumb Segment user
Control: 10c5387d Table: 4ef7404a DAC: 00000015
CPU: 0 PID: 670 Comm: out_asm Not tainted 3.10.9-rt5+ #2
Backtrace:
[<800120a4>] (dump_backtrace+0x0/0x118) from [<80012318>] (show_stack+0x20/0x24)
[<800122f8>] (show_stack+0x0/0x24) from [<804fab0c>] (dump_stack+0x24/0x28)
[<804faae8>] (dump_stack+0x0/0x28) from [<8000f560>] (show_regs+0x30/0x34)
[<8000f530>] (show_regs+0x0/0x34) from [<8003349c>](get_signal_to_deliver+0x318/0x668)
[<80033184>] (get_signal_to_deliver+0x0/0x668) from [<80011664>] (do_signal+0x11c/0x450)
[<80011548>] (do_signal+0x0/0x450) from [<80011b20>] (do_work_pending+0x74/0xac)
[<80011aac>] (do_work_pending+0x0/0xac) from [<8000e500>] (work_pending+0xc/0x20)
Segmentation fault
Another question I have is if we can we use SIMD instructions (intrinsics or autovectorization) to speed up the initialization of an array with 0? I have noticed that the following code cannot be autovectorized :
for (i=0;i<N;i++)
*(a++)=0;
however this block of code can be autovectorized:
for (i=0;i<N;i++)
a[i]=i;
My ultimate goal is to investigate if I can have a NEON function that runs faster than memset()
.
Finally i would like to ask something on unvectorizable loops. According to : http://gcc.gnu.org/projects/tree-ssa/vectorization.html#unvectoriz the following code cannot be autovectorized:
while (*p != NULL) {
*q++ = *p++;
}
However is it possible to use intrinsics or assembly to develop a faster version of this loop? If you have done something similar could you please post it here?