4

SHORT VERSION: Performance metrics of the memcpy that gets pulled from the GNU ARM toolchain seem to vary wildly on ARM Cortex-M7 for different copy sizes, even though the code that copies the data always stays the same. What could be the cause of this?

LONG VERSION:

I am part of a team developing on stm32f765 microcontroller with GNU arm toolchain 11.2, linking newlib-nano implementation of the stdlib into our code.

Recently, memcpy performace became a bottleneck in our project, and we discovered that memcpy implementation that gets pulled into our code from the newlib-nano was a simple byte-wise copy, which in hindsight should not have been surprising given the fact that the newlib-nano library is code-size optimized (compiled with -Os).

Looking at the source code of the cygwin-newlib, I've managed to track down the exact memcpy implementation that gets compiled and packaged with the nano library for ARMv7m:

    void *
__inhibit_loop_to_libcall
memcpy (void *__restrict dst0,
    const void *__restrict src0,
    size_t len0)
{
#if defined(PREFER_SIZE_OVER_SPEED) || defined(__OPTIMIZE_SIZE__)
  char *dst = (char *) dst0;
  char *src = (char *) src0;

  void *save = dst0;

  while (len0--)
    {
      *dst++ = *src++;
    }

  return save;
#else
(...)
#endif

We have decided to replace the newlib-nano memcpy implementation in our code with our own memcpy implementation, while sticking to newlib-nano for other reasons. In the process, we decided to get some performance metrics to compare the new implementation with the old one.

However, making sense of the obtained metrics prooved to be a challenge for me.

Measurement results: Perfornace metrics obtained from profiling different memcpy implementations on ARM Cortex-M7

All the results in the table are cycle counts, obtained from reading DWT-CYCCNT values (more info on the actual measurement setup will be given below).

In the table, 3 different memcpy implementations were compared. The first one is the default one that gets linked from the newlib-nano library, as suggested by the label memcpy_nano. The second and third one are the most naive, dumbest data copy implementations in C, one that copies data byte per byte, and the other one that does it word per word:

memcpy_naive_bytewise(void *restrict dest, void *restrict src, size_t size)
{
    uint8_t *restrict u8_src = src,
            *restrict u8_dest = dest;

    for (size_t idx = 0; idx < size; idx++) {
        *u8_dest++ = *u8_src++;
    }

    return dest;
}
void *
memcpy_naive_wordwise(void *restrict dest, void *restrict src, size_t size)
{
    uintptr_t upt_dest = (uintptr_t)dest;

    uint8_t *restrict u8_dest = dest,
            *restrict u8_src  = src;

    while (upt_dest++ & !ALIGN_MASK) {
        *u8_dest++ = *u8_src++;
        size--;
    }

    word *restrict word_dest = (void *)u8_dest,
             *restrict word_src  = (void *)u8_src;

    while (size >= sizeof *word_dest) {
        *word_dest++ = *word_src++;
        size -= sizeof *word_dest;
    }

    u8_dest = (void *)word_dest;
    u8_src  = (void *)word_src;

    while (size--) {
        *u8_dest++ = *u8_src++;
    }

    return dest;
}

I am unable, for the life in me, figure out why does the performance of the memcpy_nano resemble the one of the naive word-per-word copy implementation at first (up until the 256 byte-sized copies), only to start resembling the performance of the naive byte-per-byte copy implementation from 256 byte-sized copies and upwards.

I have triple-checked that indeed, the expected memcpy implementation is linked with my code for every copy size that was measured. For example, this is the memcpy disassembly obtained for code measuring the performance of 16 byte-size memcpy vs 256 byte-size copy (where the discrepency first arises):

  • memcpy definition linked for the 16 byte-sized copy (newlib-nano memcpy):
08007a74 <memcpy>:
 8007a74:   440a        add r2, r1
 8007a76:   4291        cmp r1, r2
 8007a78:   f100 33ff   add.w   r3, r0, #4294967295
 8007a7c:   d100        bne.n   8007a80 <memcpy+0xc>
 8007a7e:   4770        bx  lr
 8007a80:   b510        push    {r4, lr}
 8007a82:   f811 4b01   ldrb.w  r4, [r1], #1
 8007a86:   f803 4f01   strb.w  r4, [r3, #1]!
 8007a8a:   4291        cmp r1, r2
 8007a8c:   d1f9        bne.n   8007a82 <memcpy+0xe>
 8007a8e:   bd10        pop {r4, pc}
  • memcpy definition linked for the 256 byte-sized copy (newlib-nano memcpy):
08007a88 <memcpy>:
 8007a88:   440a        add r2, r1
 8007a8a:   4291        cmp r1, r2
 8007a8c:   f100 33ff   add.w   r3, r0, #4294967295
 8007a90:   d100        bne.n   8007a94 <memcpy+0xc>
 8007a92:   4770        bx  lr
 8007a94:   b510        push    {r4, lr}
 8007a96:   f811 4b01   ldrb.w  r4, [r1], #1
 8007a9a:   f803 4f01   strb.w  r4, [r3, #1]!
 8007a9e:   4291        cmp r1, r2
 8007aa0:   d1f9        bne.n   8007a96 <memcpy+0xe>
 8007aa2:   bd10        pop {r4, pc}

As you can see, except for the difference in where the relative address of the function is, there is no change in the actual copy logic.

Measurement setup:

  • Ensure memory and instruction caches are disabled, irqs disabled, DWT enabled:
SCB->CSSELR = (0UL << 1) | 0UL;         // Level 1 data cache
    __DSB();

    SCB->CCR &= ~(uint32_t)SCB_CCR_DC_Msk;  // disable D-Cache
    __DSB();
    __ISB();

    SCB_DisableICache();

    if(DWT->CTRL & DWT_CTRL_NOCYCCNT_Msk)
    {
        //panic
        while(1);
    }

    /* Enable DWT unit */
    CoreDebug->DEMCR |= CoreDebug_DEMCR_TRCENA_Msk;
    __DSB();

    /* Unlock DWT registers */
    DWT->LAR = 0xC5ACCE55;
    __DSB();

    /* Reset CYCCNT */
    DWT->CYCCNT = 0;

    /* Enable CYCCNT */
    DWT->CTRL |= DWT_CTRL_CYCCNTENA_Msk;

    __disable_irq();

    __DSB();
    __ISB();
  • Link one single memcpy version under test to the code, and one byte-size step. Compile the code with -O0. Then measure the execution time like (note: addresses of au8_dst and au8_src are always aligned):
uint8_t volatile au8_dst[MAX_BYTE_SIZE];
uint8_t volatile au8_src[MAX_BYTE_SIZE];

    __DSB();
    __ISB();

    u32_cyccntStart = DWT->CYCCNT;

    __DSB();
    __ISB();

    memcpy(au8_dst, au8_src, u32_size);

    __DSB();
    __ISB();

    u32_cyccntEnd = DWT->CYCCNT;

    __DSB();
    __ISB();

    *u32_cyccnt = u32_cyccntEnd - u32_cyccntStart;
  • Repeat this procedure for every combination of byte-size and memcpy version

Main question How is it possible for the execution time of the newlib-nano memcpy to follow that of a naive word-wise copy implementation up to the byte size of 256 bytes, after which it performs similarly to a naive implementation of a byte-wise copy? Please have in mind that the definition of the newlib-nano memcpy that gets pulled into the code is the same for every byte-size measurement, as demonstrated with the disassembly provided above. Is my measurement setup flawed in some obvious way that I have failed to recognize?

Any thoughts on this would be highly, highly appreciated!

Nuwanda
  • 123
  • 1
  • 6
  • Why don't you write a multi-word version? Your implementation suffers from load delay and cache-miss penalty. – Jake 'Alquimista' LEE Sep 12 '22 at 15:08
  • 1
    Best to get functionality correct first. `while (upt_dest++ & !ALIGN_MASK) { *u8_dest++ = *u8_src++; size--; }` is UB if it loops more than original `size`. `upt_dest` and `size` are independent. – chux - Reinstate Monica Sep 12 '22 at 15:43
  • 2
    the address of the function can affect the performance esp on high performance cores like arm. Where the tight loops land within the prefetch boundaries will determine performance. now saying that at least the lesser of the cortex-ms fetch on a halfword or word so with the word you can see the alighment performance, and for some of those cores that fetch size is determined by the chip company and not arm as one of their implementation choices. I dont remember about fetching on the m7 though. – old_timer Sep 12 '22 at 15:51
  • you do have chip side effects with respect to the flash, companies like st have a cache that you cannot turn off, others may as well. You would want to have both the code under test and the data under test in sram, as well as of course vary the alignments on both the source and destination addresses. also where and how you are reading the timer can affect the results, I suspect you are timing in C outside the code under test instead of as part of the code under test. – old_timer Sep 12 '22 at 15:54
  • for most of theses the dwt timer gives you the same results as the systick timer, just possibly more work to get at the debug timer vs a simple ldr to get the systick timer. I gave up using the debug timer because it added no value, so dont remember exactly the amount of work it takes to sample. – old_timer Sep 12 '22 at 15:55
  • 1
    Nuwanda, `upt_dest++ & !ALIGN_MASK` is suspicious. I'd expect `upt_dest++ & ~ALIGN_MASK`? `!` vs. `~`. Was is the value of `ALIGN_MASK` on your tests? – chux - Reinstate Monica Sep 12 '22 at 16:01
  • See my answer: [memcpy completes after segfault](https://stackoverflow.com/a/73522003/5382650) for some tips about `memcpy` on arm. – Craig Estey Sep 12 '22 at 16:01
  • "Instruction fetches, identified by ARPROT[2], are always a 64 bit transfer size, and never locked or exclusive" from the trm, not sure if that is out of context, but if it is that large, then it would be trivial to test this on a cortex-m7, time permitting I may demonstrate that later. – old_timer Sep 12 '22 at 16:05
  • 1
    @chux-ReinstateMonica You are absolutely correct, this is a nasty error on my part. I will correct it ASAP. However, this error wasn't responsible for the behaviour I have observed, because au8_dst and au8_src always ended up being aligned, and since !ALIGN_MASK always evaluates to false, this alignment part of the code was never taken. But thanks for pointing this out to me! – Nuwanda Sep 13 '22 at 12:47
  • @old_timer Thank you very much for a ton of valuable insight! I have a lot to unpack & try things for myself based on all the insight you have provided me with, so please be a bit patient with my response. – Nuwanda Sep 13 '22 at 12:50
  • @Nuwanda `restrict` may be used incorrectly here. `memcpy_naive_wordwise(void *restrict dest, void *restrict src, size_t size) ... uint8_t *restrict u8_dest = dest,`. `restrict` implies that the pointer to data (and pointers derived from it and data indexed) do not overlap other pointed to data. Yet `dest` and `u8_dest`, both with `restrict`, point to the same data. At best, the `restrict` in `uint8_t *restrict u8_dest = dest,` is simply not needed (also in 3 more places). At worst, it may confuse the compiler and prevent it from emitting the most efficient code. I suspect the former. – chux - Reinstate Monica Sep 13 '22 at 13:04
  • @Nuwanda yeah if you were not expecting this, this may be a shock. benchmarks are benchmarks, their value is iimited. What matters is how fast does my code run on my machine for my product. So while turning on and off things to tune the code are sometimes the right thing to do, its performance with all the normal settings for the product is what matters. – old_timer Sep 13 '22 at 16:06
  • I think what you are going to need to do is go backward and understand why are these memcpy's there did they need to be there and why are they a bottleneck, if they still are a problem after doing the system engineering and the analysis, then do they have to be C memcpy() or can we add bytes to the length and can we manipulate the base address of the pointers and use our own custom copy routine that takes advantage of the alignment and aligned length? if it is bad enough do we have to run this code from sram? all of this should fall out of doing the system engineering – old_timer Sep 13 '22 at 16:08
  • as far as generic benchmarks like this, your spreadsheet needs to be much wider and deeper to accommodate alignment, alignment of the input pointers, alignment of the length, and if desired various caching, flash wait states, sram vs flash, etc. one number will not do for this platform for each copy length. as the performance, literally, varies by large percentages for the same exact machine code, particularly if that machine codes location is determined by the toolchain and varies when code changes. then compiler version and command line options can amplify that with different machine code – old_timer Sep 13 '22 at 16:10

1 Answers1

1

As mentioned in comments it may be your alignment which you need to take into account for performance tests. It can be the case that one memcpy solution vs another may be hitting these fetch lines as I call them.

An stm32 cortex-m7 part.

Code under test:

/* r0 count */
/* r1 timer address */
.thumb_func
.globl TEST
TEST:
    push {r4,r5}
    ldr r4,[r1]

loop:
    sub r0,#1
    bne loop

    ldr r5,[r1]
    sub r0,r4,r5
    pop {r4,r5}
    bx lr

Original alignment

08000100 <TEST>:
 8000100:   b430        push    {r4, r5}
 8000102:   680c        ldr r4, [r1, #0]

08000104 <loop>:
 8000104:   3801        subs    r0, #1
 8000106:   d1fd        bne.n   8000104 <loop>
 8000108:   680d        ldr r5, [r1, #0]
 800010a:   1b60        subs    r0, r4, r5
 800010c:   bc30        pop {r4, r5}
 800010e:   4770        bx  lr

systick timer used, no reason to use the debug timer it adds no value.

ra=TEST(0x1000,STK_CVR);  hexstring(ra%0x00FFFFFF);
ra=TEST(0x1000,STK_CVR);  hexstring(ra%0x00FFFFFF);
ra=TEST(0x1000,STK_CVR);  hexstring(ra%0x00FFFFFF);
ra=TEST(0x1000,STK_CVR);  hexstring(ra%0x00FFFFFF);

First run

00001029 
00001006 
00001006 
00001006 

This is an stm32 so there is a flash cache that you cannot disable, so you can see that above in the first run.

The loop is aligned as such

 8000104:   3801        subs    r0, #1
 8000106:   d1fd        bne.n   8000104 <loop>

Add a nop to move the loop a half word

08000100 <TEST>:
 8000100:   46c0        nop         ; (mov r8, r8)
 8000102:   b430        push    {r4, r5}
 8000104:   680c        ldr r4, [r1, #0]

08000106 <loop>:
 8000106:   3801        subs    r0, #1
 8000108:   d1fd        bne.n   8000106 <loop>
 800010a:   680d        ldr r5, [r1, #0]
 800010c:   1b60        subs    r0, r4, r5
 800010e:   bc30        pop {r4, r5}
 8000110:   4770        bx  lr

The whole test is the same machine code from when the timer is read to timer read.

But the performance is dramatically different

00002013 
00002003 
00002003 
00002003 

Taking twice as long to execute.

If as documented that the fetch is 64 bits that is 4 instructions per fetch.

If I add one nop per test

00001028 
00001006 
00001006 
00001006 

00001027 
00001006 
00001006 
00001006 

00001026 
00001006 
00001006 
00001006 

I get three more that return 0x1000 and then...

08000100 <TEST>:
 8000100:   46c0        nop         ; (mov r8, r8)
 8000102:   46c0        nop         ; (mov r8, r8)
 8000104:   46c0        nop         ; (mov r8, r8)
 8000106:   46c0        nop         ; (mov r8, r8)
 8000108:   46c0        nop         ; (mov r8, r8)
 800010a:   b430        push    {r4, r5}
 800010c:   680c        ldr r4, [r1, #0]

0800010e <loop>:
 800010e:   3801        subs    r0, #1
 8000110:   d1fd        bne.n   800010e <loop>
 8000112:   680d        ldr r5, [r1, #0]
 8000114:   1b60        subs    r0, r4, r5
 8000116:   bc30        pop {r4, r5}
 8000118:   4770        bx  lr
 
00002010 
00002001 
00002001 
00002001 

You can run this in sram to avoid the cache, and do other things but I expect that you will see the same effect as you hit boundaries that add an extra fetch to the loop. Clearly this is best case with one fetch for the whole loop then sometimes two. Make the loop longer and it becomes N and then N+1 fetches with a less severe ratio.

I also assume the systick here is the arm clock divided by two, which is perfectly fine for this kind of performance testing.

So it is quite possible that due to the alignment of the two different functions one may be getting a performance hit and the other not from extra fetches.

What I tend to do as I did here is turn the code under test into asm, I put it in the bootstrap up near the front of the binary so that any of the other code I add or remove does not affect the alignment. I can also wrap the timer around it and loops in a very controlled manner. Adding nops, outside the timed area, to move the alignment of the loops. If you have more than one loop in the code under test you can add nops in the middle of the code under test to control the alignment of each of the loops.

You will also want to play with alignment of the data, I do not remember off hand how the cortex-ms handle unaligned accesses, if they support it, I assume they do with a performance penalty.

I demonstrated something similar to the above against MCUs, which affects you here as well. Since the srams (normal sram memory or cache memory for that matter) is not organized as bytes it is at least 32 bits wide (or wider if ecc/parity). So a single byte write requires a read-modify-write, same for a halfword, but an aligned word write does not require that read. Often this is buried in the noise because you are not doing enough writes back to back to get back pressure from the sram control logic. But at least one MCU did actually mention that you could/would see that performance, and I posted that at some point here on SO. You should also see this with unaligned word writes, that now you need two read-modify-writes.

Obviously four store instructions takes more time than one word instruction.

I will just do it why not

/* r0 address */
/* r1 count */
/* r2 timer address */
.thumb_func
.globl swtest
swtest:
    push {r4,r5}
    ldr r4,[r2]
    
swloop:
    str r3,[r0]
    str r3,[r0]
    str r3,[r0]
    str r3,[r0]

    str r3,[r0]
    str r3,[r0]
    str r3,[r0]
    str r3,[r0]

    str r3,[r0]
    str r3,[r0]
    str r3,[r0]
    str r3,[r0]

    str r3,[r0]
    str r3,[r0]
    str r3,[r0]
    str r3,[r0]
    sub r1,#1
    bne swloop
    
    ldr r5,[r2]
    sub r0,r4,r5
    pop {r4,r5}
    bx lr


ra=swtest(0x20002000,0x1000,STK_CVR);  hexstring(ra%0x00FFFFFF);
ra=swtest(0x20002000,0x1000,STK_CVR);  hexstring(ra%0x00FFFFFF);
ra=swtest(0x20002000,0x1000,STK_CVR);  hexstring(ra%0x00FFFFFF);
ra=swtest(0x20002000,0x1000,STK_CVR);  hexstring(ra%0x00FFFFFF);

ra=swtest(0x20002002,0x1000,STK_CVR);  hexstring(ra%0x00FFFFFF);
ra=swtest(0x20002002,0x1000,STK_CVR);  hexstring(ra%0x00FFFFFF);
ra=swtest(0x20002002,0x1000,STK_CVR);  hexstring(ra%0x00FFFFFF);
ra=swtest(0x20002002,0x1000,STK_CVR);  hexstring(ra%0x00FFFFFF);

00012012 
0001200A 
0001200A 
0001200A 
0002FFFD 
0002FFFD 
0002FFFD 
0002FFFD 

Unaligned is more than twice longer to execute.

Unfortunately you cannot control the addresses for a generic memcpy, so the addresses could be 0x1000 and 0x2001 and it is just going to be slow. But if the exercise here is because you have code that you need to copy often (and there is no DMA mechanism in the chip that makes that faster, remember DMA is not free, sometimes it is just a lazy approach that uses less code but runs slower, understand the architecture) but if you can control that to be word aligned addresses and whole number of word at least amounts of data to copy, then make your own copy and not not call it memcpy. And then hand tune it.


Edit, running from SRAM

for(rd=0;rd<8;rd++)
{
    rb=0x20002000;
    for(rc=0;rc<rd;rc++)
    {
        PUT32(rb,0xb430); rb+=2; //46c0         nop         ; (mov r8, r8)
    }

    PUT32(rb,0xb430); rb+=2; // 800010a:    b430        push    {r4, r5}
    PUT32(rb,0x680c); rb+=2; // 800010c:    680c        ldr r4, [r1, #0]
                             //0800010e <loop>:
    PUT32(rb,0x3801); rb+=2; // 800010e:    3801        subs    r0, #1
    PUT32(rb,0xd1fd); rb+=2; // 8000110:    d1fd        bne.n   800010e <loop>
    PUT32(rb,0x680d); rb+=2; // 8000112:    680d        ldr r5, [r1, #0]
    PUT32(rb,0x1b60); rb+=2; // 8000114:    1b60        subs    r0, r4, r5
    PUT32(rb,0xbc30); rb+=2; // 8000116:    bc30        pop {r4, r5}
    PUT32(rb,0x4770); rb+=2; // 8000118:    4770        bx  lr
    PUT32(rb,0x46c0); rb+=2;
    PUT32(rb,0x46c0); rb+=2;
    PUT32(rb,0x46c0); rb+=2;
    PUT32(rb,0x46c0); rb+=2;
    PUT32(rb,0x46c0); rb+=2;
    PUT32(rb,0x46c0); rb+=2;

    ra=HOP(0x1000,STK_CVR,0x20002001);  hexstrings(rd); hexstring(ra%0x00FFFFFF);
    ra=HOP(0x1000,STK_CVR,0x20002001);  hexstrings(rd); hexstring(ra%0x00FFFFFF);
    ra=HOP(0x1000,STK_CVR,0x20002001);  hexstrings(rd); hexstring(ra%0x00FFFFFF);
    ra=HOP(0x1000,STK_CVR,0x20002001);  hexstrings(rd); hexstring(ra%0x00FFFFFF);

}


00000000 00001011 
00000000 00001006 
00000000 00001006 
00000000 00001006 
00000001 00002010 
00000001 00002003 
00000001 00002003 
00000001 00002003 
00000002 00001014 
00000002 00001006 
00000002 00001006 
00000002 00001006 
00000003 00001014 
00000003 00001006 
00000003 00001006 
00000003 00001006 
00000004 00001014 
00000004 00001006 
00000004 00001006 
00000004 00001006 
00000005 00002010 
00000005 00002001 
00000005 00002002 
00000005 00002002 
00000006 00001012 
00000006 00001006 
00000006 00001006 
00000006 00001006 
00000007 00001014 
00000007 00001006 
00000007 00001006 
00000007 00001006 

Now still seeing that cache like effect. I do see that my CCR is 0x00040200 and I cannot disable it I believe the m7 says that you cannot.

Okay BTAC was being used but setting bit 13 in the ACTLR changes it to static branch prediction. So now the times actually make more sense, from sram:

00000000 00004003 
00000000 00004003 
00000000 00004003 
00000000 00004003 
00000001 00005002 
00000001 00005002 
00000001 00005002 
00000001 00005002 
00000002 00004003 
00000002 00004003 
00000002 00004003 
00000002 00004003 
00000003 00004003 
00000003 00004003 
00000003 00004003 
00000003 00004003 
00000004 00004003 
00000004 00004003 
00000004 00004003 
00000004 00004003 
00000005 00005002 
00000005 00005002 
00000005 00005002 
00000005 00005002 
00000006 00004003 
00000006 00004003 
00000006 00004003 
00000006 00004003 
00000007 00004003 
00000007 00004003 
00000007 00004003 
00000007 00004003 

we do see the extra fetch line but each run is consistent from sram.

Flash also showed no variation from one test to another even though I know that st has a cache thing.

00010FFC 
00010FFC 
00010FFC 
00010FFC 

This performance for flash also feels right relative to running from sram, flash is slow and not much you can do about it, so the numbers above did seem strange. And this demonstrates how many traps you can fall into in performance testing, and why all benchmarks are b......t.

And since I am having so much fun with this answer, also note that it is expected that unaligned reads should also take a performance hit to unaligned reads assuming the sram is 32 bits wide, it takes two sram bus cycles to read unaligned vs one cycle for aligned, and that should back pressure if you hit it hard enough.

With BTAC disabled

ra=swtest(0x20002000,0x1000,STK_CVR);  hexstring(ra%0x00FFFFFF);
ra=swtest(0x20002000,0x1000,STK_CVR);  hexstring(ra%0x00FFFFFF);
ra=swtest(0x20002002,0x1000,STK_CVR);  hexstring(ra%0x00FFFFFF);
ra=swtest(0x20002002,0x1000,STK_CVR);  hexstring(ra%0x00FFFFFF);

ra=lwtest(0x20002000,0x1000,STK_CVR);  hexstring(ra%0x00FFFFFF);
ra=lwtest(0x20002000,0x1000,STK_CVR);  hexstring(ra%0x00FFFFFF);
ra=lwtest(0x20002002,0x1000,STK_CVR);  hexstring(ra%0x00FFFFFF);
ra=lwtest(0x20002002,0x1000,STK_CVR);  hexstring(ra%0x00FFFFFF);

store word aligned
00019FFE 
00019FFE 
store word unaligned
00030007 
00030007 
load word aligned
00020001 
00020001 
load word unaligned
0002A00C 
0002A00C 

so if your memcpy is from 0x1000 to 0x2002 or from 0x1001 to 0x2002, even if you align one up front and then do word based copies, you still get a performance hit. Which is why I mention you need to try different alignments.

On one your questions too, I remember the full sized arm memcpy from years ago I think in newlib they had a few peformance steps, for example if the amount to copy was less than x they would just do a byte loop, done. Otherwise they would at least try to align one of them if it started at 0x1001 then they would do one byte, one halfword then a bunch of words or muliple words then based on length an extra halfword or byte at the end to finish. But that only works...if both pointers are aligned or misaligned the same way.

From your table it did not seem to me you were taking all of these factors into account. You fell into the benchmarks are b......t with one benchmark representing one source code even though that core/chip/system can run that code in a different number of clocks, sometimes strictly as a result of the C compiler and linker and no other factors.

And again

beg=get_timer();
for(i = 0;i<1000;i++)
{
  memcpy(a,b);
}
end=get_timer();

amplifies your measurement error. The for loop alone calling memcpy is also subject to fetching and branch prediction. I hope you are not testing like this.

old_timer
  • 69,149
  • 8
  • 89
  • 168
  • Michael Abrash, Zen of assembly language. You can find free versions available. The 8088 was obsolete when the book was published, some folks discount it for that, what I read in that book decades ago I use almost every day, even today to create this rambling answer to this question. He later went on to help some big name companies make their products better. – old_timer Sep 12 '22 at 20:42