Hard to provide a complete answer.
- (Commentary) I assume the
rand
is only a placeholder for an external 50/50 decision, nor for productive use?
Otherwise, be aware that rand()
sucks. It's good for making numbers look random to a moron in a hurry. Avoid the floating point division. rand()%2 is generally a bit worse than rand()>RAND_MAX/2, but that difference rarely matters.
(Commentary) you assume that sizeof(int)==4. not great.
Is there a reason to not just copy the entire buffer?
A single large copy might be faster than many small ones, even if it touches double the data.
i.e. if the uncopied elements aren't going to be used, it doesn't matter if the original data is in there. OTOH, if the uncopied elements must not be overwritten, this does not apply.
- replace the memcpy with 3 integer assignments.
Good compilers should be able to do that in most scenarios like yours right now, but memcpy can get a little complex. (It needs to check for odd lengths, might need to check for unaligned reads, etc.)
This allows the three assignments to use the multiple units per core in parallel.
- big optimization potential for parallelizing (but cache)
If you can make the random number generation non-sequential - e.g. by using 4 independent generators - could distribute the load over multiple threads, each processing one chunk of the data.
- The branch could be avoided by copying to a dummy-buffer instead
It's an interesting idea, I'm not sure if it buys you too much, though:
int dummyBuffer[3];
for(...)
{
int * target = (rand() % 2) ? dummyBuffer : cp+n;
// <-- replace with arithmetic trickery to avoid the branch
target[0] = p[n][0];
target[1] = p[n][1];
target[2] = p[n][2];
}
(As written, the branch will be moved to the assignment of "target", not much of a win. However, you probably know / can construct some trickery to make this assignment branch-free)