Assume we want to copy n
bytes of data from void* src
to void* dst
. It is well-known that standard library implementation of memcpy
is heavily optimized for using platform-dependent vectorized instructions and various other tricks to perform copying as fast as possible.
Now assume that p
bytes of data after src + n
are readable and p
bytes of data after dst + n
are writeable. Also, assume that it is OK if arbitrary garbage is written in [dst + n, dst + n + p)
.
Clearly, these assumptions widen the range of our possible actions, possibly leading to even faster memcpy. For example, we may copy some portion of less than 16 trailing bytes in a small number of unaligned 128-bit instructions (load + store). Maybe there are other tricks that are allowed by such extra assumptions.
01234 ..... n
src: abcdabcdabcdabcdabcdabcGARBAGEGA
v v
dst: ______(actl dst)________(wrtbl)_
| block1 || block2 |
Note that the assumptions are actually rather practical in cases when you need to append a sequence of strings within allocated buffer of capacity enough to hold p
+ total string size bytes. For example, a following routine may happen somewhere in database internals:
You are given a binary string
char* dictionary
and an integer arrayint* offsets
which is a monotonic sequence of offsets in dictionary; these two variables represent a dictionary of strings obtained from a disk. You also have an integer arrayint* indices
indicating an order in which dictionary strings must be written to an output bufferchar* buffer
.
Using the technique described above you may safely write each new string not caring about the garbage to the right from it, as it is going to be overridden by the next string to be appended.
The questions are:
- Are there open-source implementations of such technique? Achieving an optimal implementation would clearly require spending lot of time on (platform-dependent) tuning, so writing such code without considering existing implementations does not seem like a good idea.
- Why readability of 15 bytes past an allocation is not a feature of modern allocators? If a memory allocator could just allocate one more unitialized page of memory in each mmap it does internally, it would provide the desired readability effectively zero-cost without need to change the program code.
Final remark: this idea is not new; for example, it appears in the source code of ClickHouse. Still, they have implemented own custom templated POD array to handle such allocations.