4

recently I was profiling one application, and I have noticed that memcpy_s assembly implementation behaves strangely. I'm talking about implementation residing in Microsoft Visual Studio 14.0\VC\crt\src\i386\memcpy.asm I'm reaching the CopyUpLargeMov: then I expect it to choose the SSE2 path, or use any other available optimized implementation. the code as following:

    CopyUpLargeMov:
        bt      __favor, __FAVOR_ENFSTRG        ; check if Enhanced Fast Strings is supported
        jnc     CopyUpSSE2Check                 ; if not, check for SSE2 support
        rep     movsb
        mov     eax,[esp + 0Ch]                 ; return original destination pointer
        pop     esi
        pop     edi
        M_EXIT

Whatever I do with optimization tweaking it never reaches CopyUpSSE2Check.
Tested with Release|Win32, VS2015 Upd3, Windows10 x64.

The actual C++ code

std::vector<uint8_t> src(1024*1024*20,0);
std::vector<uint8_t> dst(1024*1024*20,0);
for (auto i = 0ul; i < 1000; ++i)
{
    memcpy_s(dst.data(), dst.size(), src.data(), src.size());
}

Any ideas?

EDIT001:
It seems that x64 does not exhibits the strange behavior, it falls into Enhanced Fast Strings optimization part of the code. Maybe the above a x86 limitation?

kreuzerkrieg
  • 3,009
  • 3
  • 28
  • 59
  • 1
    [OT] Just a FYI: MSVS has done some pretty good work with optimizing `vector` and if the data is a POD type it should be using `memxxx` functions internally. I would think `dst = src` would be just as good here and maybe better. – NathanOliver Jan 16 '17 at 15:36
  • I took vector just for convenience, in real code it is `uint8_t*` to `uint8_t*`, but vector is good enough for memcpy_s to exhibit the same odd behaviour – kreuzerkrieg Jan 16 '17 at 15:43
  • Note that 64 bit is always guaranteed to have SSE2 as part of the architecture, so no checks needed. – Jester Jan 16 '17 at 15:45
  • Already noticed that, see EDIT001 – kreuzerkrieg Jan 16 '17 at 15:46
  • 2
    Does your cpu have Intel's fast string operations? If it does, the `rep movsb` may be *faster* than SSE2. – EOF Jan 16 '17 at 15:52
  • 1
    You noticed that it doesn't check, I just pointed out **why** it doesn't need to. It wasn't clear whether you knew that or not. – Jester Jan 16 '17 at 15:57
  • @EOF The latest and the greatest i7, so it is - actually it does use EFS for x64, the question is why is the x86 so in-optimized? – kreuzerkrieg Jan 16 '17 at 17:34
  • @Jester, got your point. however, it is checks for something, except EFS, but dont remember exactly, will check it tomorrow – kreuzerkrieg Jan 16 '17 at 17:44
  • 3
    @kreuzerkrieg: It's **not unoptimized**. The function tests for *fast hardware string copy*. If the hardware does *not* support fast `rep movsb`, it *falls back to SSE*. – EOF Jan 16 '17 at 18:18
  • @EOF, you pinpoint the problem of my lagging knowledge of modern assembly, I didn't know what `rep movsb` means, actually, it is the EFS, and everything works as expected. So I just got it wrong. Would you like to convert your comment to reply so I can mark it as answer? – kreuzerkrieg Jan 16 '17 at 18:54
  • 1
    Nah, you can self-answer this one. – EOF Jan 16 '17 at 19:00
  • @Jester, for protocol, x64 `memcpy` checks as following, first try EFS then if not available go for SSE, without check, as you said. – kreuzerkrieg Jan 17 '17 at 06:31

1 Answers1

2

As @EOF pointed out in his comment, the rep movsb is the optimization. It moves the data from string to string, so called "enhanced fast strings" optimization. So I just overlooked it, the memcpy is working as it expected to work.

kreuzerkrieg
  • 3,009
  • 3
  • 28
  • 59