Imagine that I am implementing inserting and deleting within a small vector. (If this is C++ then assume further that the vector elements are trivially copyable.)
When inserting into the middle of this vector (assuming that I have ascertained that no reallocation is necessary), I know that the copy to make space for a new element must move bytes to higher addresses. Similarly, when implementing erase in the middle of this vector, I know that the copy to eliminate the erased object must move bytes to lower addresses.
memmove will sort this out, but it will spend time comparing the supplied addresses so as to choose a 'move up' or 'move down' loop. I expect my vectors to be quite small. (In reality they are the buckets in a open addressing, linear probing, RobinHood hash table.) Thus I am interested in optimizing the entire data move operation. My question is, can I eliminate that initial memmove start-up overhead? Ideally, I would like to achieve such an optimization across the big three platforms (Windows, Mac and Linux).