Is there still a performance advantage to redefine standard like memcpy?

Question

My questions is quite simple, but I can't find any clear answer, so here I am.

Nowadays C compilers are more efficient than it could be few years ago. Is there still any advantage to redefine functions like memcpy or memset in a new project ?

To be more specific lets assume that the targeted MCU on the project is a 32bit ARM core such as Cortex M or A. And GNU ARM toolchain is used.

Thanks

Have you looked at the assembler the compiler produces to see if you can do it better? — Colin, Feb 22 '19 at 09:27
In general no, unless you have a very specific use case where you think you might be able to do better. — Paul R, Feb 22 '19 at 09:27
On modern platforms/implementations just use `memcpy`, the compiler should take care of any optimisations including eliding the actual call to `memcpy` if ncessary. Look at the generated assembly code. You may find this site useful: https://godbolt.org/ — Jabberwocky, Feb 22 '19 at 09:31
*memcpy* and *memset* are part of the standard C libraries. Have a look at their source code. Both functions have highly optimized assembler implementations for most processor architectures. — Codo, Feb 22 '19 at 09:35
As an example: The GNU ARM toolchain uses the *newlib* standard C implementation. It contains several assembler implementations of *memcpy* for ARM 32-bit architectures: https://chromium.googlesource.com/native_client/nacl-newlib/+/refs/heads/master/newlib/libc/machine/arm/ — Codo, Feb 22 '19 at 09:43
FWIW, I tried writing my own `memcpy()` once, figuring I was a good enough assembly programmer that I could get some performance gains. I spent a lot of time working on my implementation, then I benchmarked it. My "soooper dooper wonderful fast `memcpy()`" was a **full order of magnitude *slower* than the out-of-the-box one.** — Andrew Henle, Feb 22 '19 at 11:25
One quite specialized reason could be safety standard compliance, such as MISRA-C. When you have a requirement that all code in the project must be safe, including libraries. This means that you essentially have to roll out your own safe subset of the standard libs, even if performance and portability will suffer from it. — Lundin, Feb 22 '19 at 14:31
I have a problem with the word "still" here. It was never universally true that there is a performance advantage in rolling your own. It depends how you implement it and whether your implementation is better than the library implementer. There have always been hand-optimised target specific libraries written in assembler available. Newlib is not one. — Clifford, Feb 23 '19 at 17:39

Antti Haapala -- Слава Україні · Answer 1 · 2019-02-22T10:32:22.573

No, it is not beneficial to redefine memcpy. The problem is that your own function cannot work like the standard library memcpy, because the C compiler knows that the function with name memcpy is the one that (C11 7.24.2.1p2)

[...] copies n characters from the object pointed to by s2 into the object pointed to by s1. If copying takes place between objects that overlap, the behavior is undefined.

and it is explicitly allowed to construct any equivalent program that behaves as if such a function is called. Sometimes it will even lead to code that does not even touch memory, memcpy being replaced by a register copy, or using an unaligned load instruction to load a value from memory into a register.

If you define your own superduperfastmemcpy in assembler, the C compiler will not know about what it does and will slavishly call it whenever asked to.

What can be beneficial however is to have a special routine for copying large blocks of memory when e.g. it is known that both source and destination address are divisible by 1k and all lengths are always divisible by 1k; in that case there could be several alternative routines that could be timed at the program start up and the fastest one be chosen to be used. Of course, copying large amounts of memory around is a sign of mostly bad design...

But if you define your own `memcpy` and _override_ the library function by using the same symbol name, I would expect the compiler to still be able to provide an optimisation that does not call the override. — Clifford, Feb 23 '19 at 17:20
@Clifford if you define your own memcpy by the same name the behaviour is undefined - the compiler need not ever call your memcpy. — Antti Haapala -- Слава Україні, Feb 24 '19 at 05:52
Undefined by the language definition perhaps (though I am taking your word for that), but in this case (gnu arm, Newlib), the behaviour is clear (and determined by the behaviour of the linker). Yes, the compiler may not call the override for the same reason it may not call the library implementation as you have outlined - that was rather my point. — Clifford, Feb 24 '19 at 07:44

Clifford · Answer 2 · 2019-02-24T07:40:39.167

The question is only answerable as other than a matter of opinion because you have been specific about the target and toolchain. It is not possible to generalise (and never has been).

The GNU ARM toolchain uses the Newlib C library. Newlib is designed to be architecture agnostic and portable. As such it is written in C rather then assembler, so its performance is determined by the code generation of the compiler and in turn the compiler options applied when the library is built. It is possible to build for a very specific ARM architecture, or to build for more generic ARM instruction subset; that will affect performance too.

Moreover Newlib itself can be built with various conditional compilation options such as PREFER_SIZE_OVER_SPEED and __OPTIMIZE_SIZE__.

Now if you are able to generate better ARM assembler code (and have the time) than the compiler, then that is great, but such kung-foo coding skills are increasingly rare and frankly increasingly unnecessary. Do you have sufficient assembler expertise to beat the compiler; do you have time, and do you really want to do that for every architecture you might use? It may be a premature optimisation, and be rather unproductive.

In some circumstances, on targets with the capability, it may be worthwhile setting up a memory-to-memory DMA transfer. The GNU ARM compiler will not generate DMA code because that is chip vendor dependent and not part of the ARM architecture. However memcpy is general purpose for arbitrary copy size alignment and thread safety. For specific circumstances where DMA is optimal, better perhaps to define a new differently named routine and use it where it is needed rather than redefine memcpy and risk it being sub-optimal for small copies which may predominate, or multi-threaded applications.

The implementation of memcpy() in Newlib for example can be seen here. It is a reasonable idiomatic implementation and therefore sympathetic to a typical compiler optimiser, which generally work best on idiomatic code. An alternative implementation may perform better in un-optimised compilation, but if it is "unusual", the optimiser may not work as well. If you are writing it in assembler, you just have to be better than the compiler - you'd be a rare though not necessarily valuable (commercially) commodity. That said, looking at this specific implementation, it does look far less efficient for large un-aligned blocks in the speed-over-size implementation. It would be possible to improve that at some small expense perhaps to more common aligned copies.

alinsoar · Accepted Answer · 2019-02-25T09:23:47.190

2

The functions like memcpy belong to the standard library and almost sure they are implemented in assembler, not in C.

If you redefine them it will surely work slower. If you want to optimize memcpy you should either use memmove instead or declaring the pointers as restrict, to tell that they do not overlap and treat them as fast as memmove.

Those engineers who wrote the Standard C library for the given arhitechture for sure they used the existing assembler function to move memory faster.

EDIT:

Taking the remarks from some comments, every generation of code that keeps the semantics of copying (including replacing memcpy by mov-instructions or other code) is allowed.

For algorithms of copying (including the algorithm that newlib is using) you can check this article . Quote from this article:

Special situations If you know all about the data you're copying as well as the environment in which memcpy runs, you may be able to create a specialized version that runs very fast

edited Feb 25 '19 at 09:23

answered Feb 22 '19 at 09:43

alinsoar

15,386
4
57
74

2

You also should mention that in many cases the call to `memcpy` will be elided, especially for memcpy of small aligned sizes, such as 4,8 etc. the compiler will then just emit one or two `mov` instructions. – Jabberwocky Feb 22 '19 at 09:48
@Jabberwocky: I would have doubted that. But then I tried it on https://godbolt.org and indeed it works, e.g.: https://godbolt.org/z/pwB3k5 – Codo Feb 22 '19 at 10:04
@Codo the link on godbolt you show is a bad example, the call to memcpy has not been elided but it has been optimized away alltogether because the code has no observable effect. Remove the `-Os` compiler flag, then you'll see the actual elision of the call to `memcpy` – Jabberwocky Feb 22 '19 at 10:07
6

You have memcpy and memmove reversed. – R.. GitHub STOP HELPING ICE Feb 22 '19 at 15:30
On the other hand, memcpy() and memmov() may be very "large" functions to get the best performance at the expense of code size. At least on the ARM9 (not cortex-A9) a speed optimized version would deal with non-uint64 aligned head and tail, then do an unwound LDM/STM on all the aligned data in the middle. I work in a lot of memory constrained products so re-writing parts of the stdlib made sense for us. – Russ Schultz Feb 22 '19 at 23:46
1

The thing is though that the Newlib library mentioned in the question _is not_ written for a specific architecture - it is written to be portable - in C. https://github.com/eblot/newlib/blob/master/newlib/libc/string/memcpy.c – Clifford Feb 23 '19 at 17:34
@RussSchultz : Looking at the [Newlib version](https://github.com/eblot/newlib/blob/master/newlib/libc/string/memcpy.c), it deals only with unaligned tails; if the head is not aligned it does a byte copy regardless of size. Seems an odd decision, but it seems the implementer feels that unaligned heads are uncommon enough that every call should not have the overhead of a test. I am not so sure. That said it is in C so the optimiser may well "fix" it in any event. – Clifford Feb 23 '19 at 18:55
@Clifford - That Newlib version also does a byte copy for any block less than four longs in size, even if it's aligned. As you suggest, they presumably trust the optimiser to deal with that. – Jeremy Mar 13 '19 at 08:24

score 1 · Answer 4 · answered Feb 25 '19 at 09:04

There are several points here, maybe already mentioned above:

Certified libs: usually they are not certified to run if safety constrained environments. Developed according to certain ASPICE/CMM level is usually never provided, and these libs can therefore not be used in such envrionments.
Architecture specific implementations: Maybe your own implementation uses some very target specific features, that the libs can not provide, e.g. specific load/store instructions (SIMD, vector based instructions), or even a DMA based implementation for bigger data, or using different implementations in case of multiprocessor with different core architectures (e.g. NXP S32 with e200z4 and e200z7 cores, or ARM M5 vs. A53), and the lib would need to find out on which core it is called to get the best perfomance
Since embedded development is according to C-standard "freestanding" and not "hosted", a big part of the standard is "implementation defined" or even "unspecified", and that includes the libs.

Is there still a performance advantage to redefine standard like memcpy?

4 Answers4