Faster memcpy assuming readability and writability of bytes past source and destination buffers

Question

Assume we want to copy n bytes of data from void* src to void* dst. It is well-known that standard library implementation of memcpy is heavily optimized for using platform-dependent vectorized instructions and various other tricks to perform copying as fast as possible.

Now assume that p bytes of data after src + n are readable and p bytes of data after dst + n are writeable. Also, assume that it is OK if arbitrary garbage is written in [dst + n, dst + n + p).

Clearly, these assumptions widen the range of our possible actions, possibly leading to even faster memcpy. For example, we may copy some portion of less than 16 trailing bytes in a small number of unaligned 128-bit instructions (load + store). Maybe there are other tricks that are allowed by such extra assumptions.

     01234 .....            n
src: abcdabcdabcdabcdabcdabcGARBAGEGA
            v              v 
dst: ______(actl dst)________(wrtbl)_
     |    block1    ||    block2    |

Note that the assumptions are actually rather practical in cases when you need to append a sequence of strings within allocated buffer of capacity enough to hold p + total string size bytes. For example, a following routine may happen somewhere in database internals:

You are given a binary string char* dictionary and an integer array int* offsets which is a monotonic sequence of offsets in dictionary; these two variables represent a dictionary of strings obtained from a disk. You also have an integer array int* indices indicating an order in which dictionary strings must be written to an output buffer char* buffer.

Using the technique described above you may safely write each new string not caring about the garbage to the right from it, as it is going to be overridden by the next string to be appended.

The questions are:

Are there open-source implementations of such technique? Achieving an optimal implementation would clearly require spending lot of time on (platform-dependent) tuning, so writing such code without considering existing implementations does not seem like a good idea.
Why readability of 15 bytes past an allocation is not a feature of modern allocators? If a memory allocator could just allocate one more unitialized page of memory in each mmap it does internally, it would provide the desired readability effectively zero-cost without need to change the program code.

Final remark: this idea is not new; for example, it appears in the source code of ClickHouse. Still, they have implemented own custom templated POD array to handle such allocations.

Given your assumptions there isn't much more to it than a for loop doing loads and stores using the widest instructions available on your target processors — Alan Birtles, Dec 24 '21 at 08:28
Good `memcpy`-implementations already copy large buffers in (possibly multiple of) the widest available vector registers for the platform, and some constant overhead massaging the start and end of the buffer does not matter much. For small buffers, you're unlikely to significantly improve performance with your proposal, unless your program *only* copying small, poorly aligned buffers, which would be pretty pathological. — EOF, Dec 24 '21 at 10:14
This question is of genuine interest and my answer provides clues for specific cases where substantial performance improvements can be obtained. — chqrlie, Dec 24 '21 at 15:02

chqrlie · Answer 1 · 2021-12-24T15:09:06.140

Now assume that p bytes of data after src + n are readable and p bytes of data after dst + n are writeable. Also, assume that it is OK if arbitrary garbage is written in [dst + n, dst + n + p).

These assumptions are impractical in most cases:

If the target OS is known to allocate blocks with a size multiple of some alignment value such as 16, 4K or even 8K, the compiler and the library can assume extra bytes of data can be read at the end of an unaligned block. So for your purpose, it seems you could save some tests on the last chunk read.
Conversely, a library function should not make the other two assumptions, so a generic implementation of memcpy will not do this, even if it restores the previous contents of the area beyond dst + n, as this would be potentially incorrect in a multi-threaded process.
The programmer could try and round n up by a few bytes in an attempt to shave some cycles, but it is very tricky and error prone.
In most cases, there is very little to gain with this approach. memcpy is already optimized inline for many constant sizes and the difference would be infinitesimal for large sizes.

Yet if you know the actual data size is in the range [1 .. 16], and 16 bytes can be read/written harmlessly, you could optimize the call by specifying a constant maximal size: memcpy(dst, src, 16). You can adjust for other similar cases, check the generated code and measure actual performance, which could be substantially better than memcpy(dst, src, n) with n variable but known to be in the expected range.

score 1 · Answer 2 · answered Dec 24 '21 at 13:48

If you implement vectorized memchr or memcmp, where you only read, you can align vectors to natural boundaries, and for the first/last incomplete vectors mask out paddings. By processing in naturally aligned chunks, you'll never hit the next page in the same read, so reading unallocated data should be safe, even though the extra space is not allocated.

Apparently it is the only way to vectorize memchr, as this function should be prepared to work when count parameter is more than the actual range, if the result is found earlier than the actual end. cppreference:

This function behaves as if it reads the characters sequentially and stops as soon as a matching character is found: if the array pointed to by ptr is smaller than count, but the match is found within the array, the behavior is well-defined

For the writes, I don't think it is possible to do any optimization, unless vectorization instructions support masked writes.

Assume you have an array separated by user at some arbitrary index, and separate threads memcpy the different data into the parts. If memcpy would write out of range, there would be a data race.

Faster memcpy assuming readability and writability of bytes past source and destination buffers

2 Answers2