NEON memcpy , memset and using .c with .s files

Question

I am trying to get familiar with Neon instructions. Both assembly and intrinsics. I usee gcc V4.8.2 hardfp I would like to use the NEON memcpy with preload accordindg to :

http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.faqs/ka13544.html

I have also found this topic : ARM memcpy and alignment but this is slightly different from the official ARM page implementation.

Unfortunately I have never used .s with .c files at the same time so I need some help. My .c file looks like this:

       #include <stdlib.h>
       #include <stdio.h>
       #include <string.h>
       #include <math.h>
       #include <time.h>
       #include <stdint.h>
       #include <arm_neon.h> 

       int main()
       {

           clock_t start, end;           // timer variables
           uint32_t i,X=100;

           size_t size = 2048*32/* arbitrary */;
           size_t offset = 1;
           char* src = malloc(sizeof(char)*(size + offset));
           char* dst = malloc(sizeof(char)*(size));

           NEONCopyPLD( dst, src + offset, size );
           memcpy( dst, src + offset, size );
           return(0);
       }

and the assembly.s file is the following:

       .global NEONCopyPLD
       NEONCopyPLD:
             PLD [r1, #0xC0]
             VLDM r1!,{d0-d7}
             VSTM r0!,{d0-d7}
             SUBS r2,r2,#0x40
             BGE NEONCopyPLD

I compile the following program by using the instruction :

arm-linux-gnueabihf-gcc -mthumb -march=armv7-a -mtune=cortex-a9 -mcpu=cortex-a9 -mfloat-abi=hard -mfpu=neon -Ofast -fprefetch-loop-arrays assembly.s asm_pr.c -o output

and I get the following error:

 potentially unexpected fatal signal 11.

 CPU: 0 PID: 670 Comm: out_asm Not tainted 3.10.9-rt5+ #2
 task: bf907c00 ti: bef4a000 task.ti: bef4a000
 PC is at 0x4c90ce LR is at 0x852d
 pc : [<004c90ce>]    lr : [<0000852d>]    psr: 40030030
 sp : 7e958cb0  ip : 00000107  fp : 00000000
 r10: 76f91000  r9 : 00000000  r8 : 00000000
 r7 : 00001017  r6 : 00e85010  r5 : 00e75009  r4 : 00010001
 r3 : 000f4240  r2 : 00010000  r1 : 00e75009  r0 : 00e85010
 Flags: nZcv  IRQs on  FIQs on  Mode USER_32  ISA Thumb  Segment user
 Control: 10c5387d  Table: 4ef7404a  DAC: 00000015
 CPU: 0 PID: 670 Comm: out_asm Not tainted 3.10.9-rt5+ #2
 Backtrace:
 [<800120a4>] (dump_backtrace+0x0/0x118) from [<80012318>] (show_stack+0x20/0x24)
 [<800122f8>] (show_stack+0x0/0x24) from [<804fab0c>] (dump_stack+0x24/0x28)
 [<804faae8>] (dump_stack+0x0/0x28) from [<8000f560>] (show_regs+0x30/0x34)
 [<8000f530>] (show_regs+0x0/0x34) from [<8003349c>](get_signal_to_deliver+0x318/0x668)   
 [<80033184>] (get_signal_to_deliver+0x0/0x668) from [<80011664>] (do_signal+0x11c/0x450)
 [<80011548>] (do_signal+0x0/0x450) from [<80011b20>] (do_work_pending+0x74/0xac)
 [<80011aac>] (do_work_pending+0x0/0xac) from [<8000e500>] (work_pending+0xc/0x20)
 Segmentation fault

Another question I have is if we can we use SIMD instructions (intrinsics or autovectorization) to speed up the initialization of an array with 0? I have noticed that the following code cannot be autovectorized :

   for (i=0;i<N;i++)
        *(a++)=0;

however this block of code can be autovectorized:

   for (i=0;i<N;i++)
       a[i]=i;

My ultimate goal is to investigate if I can have a NEON function that runs faster than memset().

Finally i would like to ask something on unvectorizable loops. According to : http://gcc.gnu.org/projects/tree-ssa/vectorization.html#unvectoriz the following code cannot be autovectorized:

           while (*p != NULL) {
              *q++ = *p++;
           }

However is it possible to use intrinsics or assembly to develop a faster version of this loop? If you have done something similar could you please post it here?

NEONCopyPLD -> "NEONCopyPLD:" notice the semicolon. don't put many (three?) questions in same post. thing about neon is, it is a SIMD unit it should be independent of memory speeds however most actual implementations give SIMD its own memory port thus NEON unit might have faster access to memory. Bottom line is if you are a vendor you would know if memset should use NEON, but in general it shouldn't be slower than regular core. — auselen, Apr 24 '14 at 10:59
Thanks for your comment. I changed some other things as well as inserting a semi column and I was able to compile. Although I am getting a segmentation fault! — Nick, Apr 24 '14 at 12:52
You should ask one thing at a time. Do a clean up, and ask one thing. Then if you have another question, either post another one or wait for the first one to see if you'll get what you want anyway. — auselen, Apr 24 '14 at 13:23

score 2 · Answer 1 · answered May 09 '14 at 11:55

Unrelated to your question as such, but your code sample as shown cannot work correctly. That's because you seem to have alignment traps active, and are hitting one:

       [ ... ]
       size_t offset = 1;
       char* src = malloc(sizeof(char)*(size + offset));
       [ ... ]
       NEONCopyPLD( dst, src + offset, size );

r7 : 00001017  r6 : 00e85010  r5 : 00e75009  r4 : 00010001
r3 : 000f4240  r2 : 00010000  r1 : 00e75009  r0 : 00e85010
                                   ^^^^^^^^

You're using a misaligned pointer with VLDM (src is never aligned due to offset == 1).

From the reg dump, since your Neon asm function doesn't use R5 on its own, the fact you're seeing R1 == R5 makes me conclude you're running with alignment traps enabled, and get the SIGSEGV the very first time you hit that VLDM.
That's because you're not using R5 in your assembly, so the value being there has been used before by the C function; therefore R1 and R5 not being different means R1 hasn't changed before the trap was taken, and that means the VLDM R1!,... cannot have executed even once.

score 1 · Answer 2 · answered Apr 27 '14 at 11:23

You never return from your assembler functions. Therefore whatever code is stored below the assembler function will get executed. This will lead to a crash sooner or later.

Exit your functions with:

mov pc, lr

This will very likely fix your problems. You should also check which registers (neon and general purpose registers) you have to preserve during assembler function calls.

This page here is a useful resource that shows examples how to do this: http://omappedia.org/wiki/Writing_ARM_Assembly

score 1 · Answer 3 · answered May 08 '14 at 05:38

You may google for "aosp bionic memcpy".

It's not a perfect, but quite a decent implementation.

I suggest you starting with memset instead though since memcpy is much more complicated than you might think.

Analyze bionic memset, try to understand the flow, and ask if you don't understand why the author did something in that particular way.

And I also don't understand why you are talking about auto-vectorization which is utterly useless IMO.

Please, do some study on your own first, and ask if you get stuck.

To answer this particular question, it would take a whole tutorial consisting of multiple chapters, starting with basic ARM instructions.

Thanks for your answer. i will look at bionic memcpy and memset ! — Nick, May 08 '14 at 11:50

NEON memcpy , memset and using .c with .s files

3 Answers3