2

This may seem like a stupid/obvious question to some of you, but I'm still learning so please be gentle haha.

I'm writing an application without the CRT, so I have to implement my own memcpy function. After doing everything and getting it working, I noticed the application was performing significantly slower than it's CRT counterpart. After a while I tracked it down to my custom memcpy function.

void* _memcpy(void* destination, void* source, size_t num)
{
    char* d = (char*)destination;
    char* s = (char*)source;
    while (num--)
        *d++ = *s++;
    return destination;
}

My friend told me this was a complete sh*t implementation, so I'm posting this here to ask how I could at least improve it to meet the performance of it's CRT counterpart. And also to get an explanation of why it's so slow

Meme Machine
  • 949
  • 3
  • 14
  • 28
  • 3
    @MemeMachine -- The maintainers of the library version spent a good amount of time making sure the memcpy is optimized, possibly writing it in assembly language and/or usage of intrinsics. There is little chance you're going to outdo or equal what is already provided. The reason why memcpy is "superoptimized" is that this function is literally the backbone in making software "fast", thus it is highly important to squeeze all the performance out of it. – PaulMcKenzie Feb 03 '21 at 19:55
  • Everything here that doesn't understand that the compiler knows this is a memcpy is wrong. @M.M has the only answer worth looking at. Optimizing for a non-optimized build is a waste of time unless you actively need to improve debugging performance (which is likely not the case here) – xaxxon Feb 03 '21 at 20:00
  • Look at one of the Open Source implementations it's a good bet they compiler intrinsic. – Richard Critten Feb 03 '21 at 20:00
  • You need to look and see what your compiler actually produces in an optimized build to determine if your implementation is good/bad. You didn't even say what compiler you're using so we can't reproduce it ourselves. – xaxxon Feb 03 '21 at 20:01
  • 3
    @MemeMachine -- Just because you say you don't want to use the CRT doesn't mean you can't take a look and see how the CRT implements the function. Then you can use that as a guide in creating your own, instead of trying to do this from scratch and hoping whatever you coded is fast enough. If you had done that, your friend wouldn't have had anything to complain about. – PaulMcKenzie Feb 03 '21 at 20:04
  • @PaulMcKenzie Do you suggest building with the CRT memcpy, disassembling it and then implementing? – Meme Machine Feb 03 '21 at 20:06
  • Note that this implementation is going to be problematic if the ranges overlap. – Joseph Larson Feb 03 '21 at 20:06
  • 2
    @JosephLarson does memcpy deal with that? I didn't think so – xaxxon Feb 03 '21 at 20:07
  • @MemeMachine The memcpy source code should be available to you. But yes, one way to see what memcpy does is to write a two line program and step into the function using the debugger. – PaulMcKenzie Feb 03 '21 at 20:07
  • related: [How does the internal implementation of memcpy work?](https://stackoverflow.com/questions/17498743/how-does-the-internal-implementation-of-memcpy-work) – francesco Feb 03 '21 at 20:09
  • 1
    An optimal solution is also going to be CPU architecture dependent. See this article: http://www.danielvik.com/2010/02/fast-memcpy-in-c.html – Joseph Larson Feb 03 '21 at 20:10
  • @JosephLarson — `memcpy` has a precondition hat the two ranges do not overlap, so that’s not an issue for this implementation. `memmove` has to deal correctly with overlapping ranges. – Pete Becker Feb 03 '21 at 20:49
  • @M.M in my experiments, it does not happen. – SergeyA Feb 03 '21 at 21:59

1 Answers1

2

First thing first. Computers handle things in words. Typical word size is 4 or 8 bytes long (except on some 8 bit micros). If you can copy a word at a time, things will be much faster.

There are complications though. Many processors don't like misaligned access so each copy should on word boundaries.

Other optimizations might include pre fetching data but these start becoming more complicated.

Take a look a newlib-nano's implementation for inspiration. https://github.com/eblot/newlib/blob/master/newlib/libc/string/memcpy.c

doron
  • 27,972
  • 12
  • 65
  • 103