Do strncpy/memcpy/memmove copy the data byte by byte or in another efficiently way?

Question

As we know, in a multi-bytes word computer such as x86/x86_64, it is more efficiently to copy/move a big bulk of memory word by word (4 or 8 bytes per step), than to do so byte by byte.

I'm curious about which way would strncpy/memcpy/memmove do things in, and how do they deal with memory word alignment.

char buf_A[8], buf_B[8];

// I often want to code as this
*(double*)buf_A = *(double*)buf_B;

//in stead of this
strcpy(buf_A, buf_B);
// but it worsen the readability of my codes.

The way `strcpy` and friends work is totally implementation dependent. Usually they just do the job in an efficient manner. Oh yes and `strcpy` can only be used for copying NUL terminated strings (read the chapter dealing with strings in your C text book). In your case you should use `memcpy`. — Jabberwocky, Jan 22 '19 at 11:02
On most platforms the source code of `memcpy` etc. is included, have a look at these, but the code can be pressy surprising. Also often they are hand written in assembly language. — Jabberwocky, Jan 22 '19 at 11:05
Thanks, Jabberwocky! I'm using gcc 8.2.0 in a x86_64. Would you please tell me where can I find the source of the memcpy/strcpy? I have entire source of gcc, but I don't know how to swim in it. It is an ocean indeed. — Leon, Jan 22 '19 at 11:17
Unless you have a good reason to think otherwise, you should assume that the standard library implementor knows the target hardware and the library requirements better than you do. That's not a putdown, just a comment on expertise. — Pete Becker, Jan 22 '19 at 14:19

score 6 · Answer 1 · answered Jan 22 '19 at 11:21

In general, you don't have to think too much about how memcpy or other similar functions are implemented. You should assume they are efficient unless your profiling proves you wrong.

In practice it indeed is optimized nicely. See e.g. the following test code:

#include <cstring>

void test(char (&a)[8], char (&b)[8])
{
    std::memcpy(&a,&b,sizeof a);
}

Compiling it with g++ 7.3.0 with the command g++ test.cpp -O3 -S -masm=intel we can see the following assembly code:

test(char (&) [8], char (&) [8]):

    mov     rax, QWORD PTR [rsi]
    mov     QWORD PTR [rdi], rax
    ret

As you can see, the copy is not only inlined, but also collapsed into a single 8-byte read and write.

Oliv · Answer 2 · 2019-01-22T12:34:57.440

In this case you may prefer to use memcpy as this is the equivalent of *(double*)buf_A = *(double*)buf_B; without undefined behavior.

You should not worry about calling memcpy because by default the compiler supposes that a call to memcpy has the meaning defined in the c library. So depending on the type of the argument and or the knowledge of the size of the copy at compilation-time, the compiler may choose to not call the c library function and inline a more adapted memory copy strategy. On gcc you can disable this behavior with the -fno-builtin compiler option: demo.

The replacement of memcpy call by the compiler is desired because memcpy will check the size and alignment of the pointers to use the most efficient memory copy strategy (It may start to copy as small blocks as char by char to very large blocks using AVX512 instruction for example). These checks and whatsoever the call to memcpy cost.

Also If you are looking for efficiency, you should be concerned about memory alignment. So you may want to declare the alignment of your buffer:

alignas(8) char buf_A[8];

score 1 · Answer 3 · edited Jun 20 '20 at 09:12

From cpp-reference:

Copies count bytes from the object pointed to by src to the object pointed to by dest. Both objects are reinterpreted as arrays of unsigned char.

NOTES

std::memcpy is meant to be the fastest library routine for memory-to-memory copy. It is usually more efficient than std::strcpy, which must scan the data it copies or std::memmove, which must take precautions to handle overlapping inputs.

Several C++ compilers transform suitable memory-copying loops to std::memcpy calls.

Where strict aliasing prohibits examining the same memory as values of two different types, std::memcpy may be used to convert the values.

So it should be the quickest way to copy data. Be aware however, that there are several cases where the behavior is undefined:

If the objects overlap, the behavior is undefined.

If either dest or src is a null pointer, the behavior is undefined, even if count is zero.

If the objects are potentially-overlapping or not TriviallyCopyable, the behavior of memcpy is not specified and may be undefined.

Thank you! But I still want to know which way std::memcpy do things in. Byte by byte, or word by word? — Leon, Jan 22 '19 at 11:09
You copy a number of bytes. There is no guarantee of implementation. — , Jan 22 '19 at 11:11

eerorika · Answer 4 · 2019-01-22T12:49:20.257

Does strcpy/strncpy copy the data byte by byte or in another efficiently way?

The C++ nor C standard don't specify how strcpy/strncpy are implemented exactly. They only describe the behaviour.

There are multiple standard library implementations and each choose how to implement their functions. It is possible to implement both of those using memcpy. The standards don't exactly describe the implementation of memcpy either, and the existence of multiple implementations apply to it just as well.

memcpy can be implemented taking advantage of full word copy. A short pseudocode of how memcpy could be implemented:

if len >= 2 * word size
    copy bytes until destination pointer is aligned to word boundary
    if len >= page size
        copy entire pages using virtual address manipulation
    copy entire words
 copy the trailing bytes that are not aligned to word boundary

To find out how a particular standard library implementation implements strcpy/strncpy/memcpy, you can read the source code of the standard library - if you have access to it.

Even further, when the length is known at compile time, the compiler might even choose to not use the library memcpy, but instead do the copy inline. Whether your compiler has built in definitions for standard library functions, you can find out in the documentation of the respective compiler.

Victor Gubin · Answer 5 · 2019-01-22T12:29:37.103

It depends on the compiler you are using and C run-time library you are using. In most cases string.h functions like memcmp, memcpy, strcpu, memset etc implemented using assembly in the CPU optimized way.

You can found the GNU libc implementations of those functions for the AMD64 arhitecture. As you can see it may use SSE or AVX instructions to copy 128 and 512 bits per iteration. Microsoft also bundle source code of their CRT together with Visual Studio (the same approaches mostly, MMX, SSE, AVX loops supported).

Also compiler uses special optimization for such functions, GCC call them builtins other compiler call them intrinsic. I.e. compiler may choose - call a library function, or generate CPU specific assembly code optimal for the current context. For example, when N argument of memcpy is constant i.e memcpy(dst, src, 128) compiler may generate inline assembly code (something like mov 16,rcx cls rep stosq), and when it is a variable i.e. memcpy(dst,src,bytes) - compiler may insert call to library function (something like call _memcpy)

score 0 · Accepted Answer · edited Feb 17 '19 at 17:07

I think all of the opinions and advices on this page are reasonable, but I decide to try a little experiment.

To my surprise, the fastest method isn't the one we expected theoretically.

I tried some code as following.

#include <cstring>
#include <iostream>
#include <string>
#include <chrono>

using std::string;
using std::chrono::system_clock;

inline void mycopy( double* a, double* b, size_t s ) {
   while ( s > 0 ) {
      *a++ = *b++;
      --s;
   }
};

// to make sure that every bits have been changed
bool assertAllTrue( unsigned char* a, size_t s ) {
   unsigned char v = 0xFF;
   while ( s > 0 ) {
      v &= *a++;
      --s;
   }
   return v == 0xFF;
};

int main( int argc, char** argv ) {
   alignas( 16 ) char bufA[512], bufB[512];
   memset( bufB, 0xFF, 512 );  // to prevent strncpy from stoping prematurely
   system_clock::time_point startT;

   memset( bufA, 0, sizeof( bufA ) );
   startT = system_clock::now();
   for ( int i = 0; i < 1024 * 1024; ++i )
      strncpy( bufA, bufB, sizeof( bufA ) );
   std::cout << "strncpy:" << ( system_clock::now() - startT ).count()
             << ", AllTrue:" << std::boolalpha
             << assertAllTrue( ( unsigned char* )bufA, sizeof( bufA ) )
             << std::endl;

   memset( bufA, 0, sizeof( bufA ) );
   startT = system_clock::now();
   for ( int i = 0; i < 1024 * 1024; ++i )
      memcpy( bufA, bufB, sizeof( bufA ) );
   std::cout << "memcpy:" << ( system_clock::now() - startT ).count()
             << ", AllTrue:" << std::boolalpha
             << assertAllTrue( ( unsigned char* )bufA, sizeof( bufA ) )
             << std::endl;

   memset( bufA, 0, sizeof( bufA ) );
   startT = system_clock::now();
   for ( int i = 0; i < 1024 * 1024; ++i )
      memmove( bufA, bufB, sizeof( bufA ) );
   std::cout << "memmove:" << ( system_clock::now() - startT ).count()
             << ", AllTrue:" << std::boolalpha
             << assertAllTrue( ( unsigned char* )bufA, sizeof( bufA ) )
             << std::endl;

   memset( bufA, 0, sizeof( bufA ) );
   startT = system_clock::now();
   for ( int i = 0; i < 1024 * 1024; ++i )
      mycopy( ( double* )bufA, ( double* )bufB, sizeof( bufA ) / sizeof( double ) );
   std::cout << "mycopy:" << ( system_clock::now() - startT ).count()
             << ", AllTrue:" << std::boolalpha
             << assertAllTrue( ( unsigned char* )bufA, sizeof( bufA ) )
             << std::endl;

   return EXIT_SUCCESS;
}

The result (one of many similar results):

strncpy:52840919, AllTrue:true

memcpy:57630499, AllTrue:true

memmove:57536472, AllTrue:true

mycopy:57577863, AllTrue:true

It looks like:

memcpy, memmove, and my own method have similar result;
What does strncpy do magic, so that it is the best one even faster than memcpy?

Is it funny?

memcpy, memmove, and my own method have similar result; - becasue compiler replaced you'r method to the library call to memcpy during optimization. If you'll off optimization you'll see the difference. — Victor Gubin, Sep 06 '19 at 11:27
Maybe the compiler just moved what you benched out of the timed region! Benchmarking is not easy. — Oliv, Nov 07 '22 at 21:25

Do strncpy/memcpy/memmove copy the data byte by byte or in another efficiently way?

6 Answers6