The question is only answerable as other than a matter of opinion because you have been specific about the target and toolchain. It is not possible to generalise (and never has been).
The GNU ARM toolchain uses the Newlib C library. Newlib is designed to be architecture agnostic and portable. As such it is written in C rather then assembler, so its performance is determined by the code generation of the compiler and in turn the compiler options applied when the library is built. It is possible to build for a very specific ARM architecture, or to build for more generic ARM instruction subset; that will affect performance too.
Moreover Newlib itself can be built with various conditional compilation options such as PREFER_SIZE_OVER_SPEED
and __OPTIMIZE_SIZE__
.
Now if you are able to generate better ARM assembler code (and have the time) than the compiler, then that is great, but such kung-foo coding skills are increasingly rare and frankly increasingly unnecessary. Do you have sufficient assembler expertise to beat the compiler; do you have time, and do you really want to do that for every architecture you might use? It may be a premature optimisation, and be rather unproductive.
In some circumstances, on targets with the capability, it may be worthwhile setting up a memory-to-memory DMA transfer. The GNU ARM compiler will not generate DMA code because that is chip vendor dependent and not part of the ARM architecture. However memcpy
is general purpose for arbitrary copy size alignment and thread safety. For specific circumstances where DMA is optimal, better perhaps to define a new differently named routine and use it where it is needed rather than redefine memcpy
and risk it being sub-optimal for small copies which may predominate, or multi-threaded applications.
The implementation of memcpy()
in Newlib for example can be seen here. It is a reasonable idiomatic implementation and therefore sympathetic to a typical compiler optimiser, which generally work best on idiomatic code. An alternative implementation may perform better in un-optimised compilation, but if it is "unusual", the optimiser may not work as well. If you are writing it in assembler, you just have to be better than the compiler - you'd be a rare though not necessarily valuable (commercially) commodity. That said, looking at this specific implementation, it does look far less efficient for large un-aligned blocks in the speed-over-size implementation. It would be possible to improve that at some small expense perhaps to more common aligned copies.