how would you optimize this function?

Question

#include <stdlib.h> 
#include <cstring.h>
#include <time.h>

int cp[1000000][3];
int p[1000000][3];//assume this array to be populated

void  main(){

srand(time(NULL));

for(n; n < 1000000; n++){
    if (rand()%2)
        memcpy(cp[n], p[n], 12);
    }
}

}

This is a somewhat simplified version of the actual code I am using. This code takes up a significant part of my process I was wondering if I can optimize this with some clever tricks. I have used pointers before to avoid branching, but I can't figure out how to apply that here.

`if ((float)rand() > 0.5)` doesn't make any sense to me. `rand()` returns 0 to RAND_MAX so the only time the loop is false is when it returns 0. Is that really what you want? — NathanOliver, Sep 29 '16 at 13:16
Now you have the problem where `(rand()/RAND_MAX)` is integer division so it will always be 0 unless `rand()` returns `RAND_MAX`. Then it will be 1. — NathanOliver, Sep 29 '16 at 13:25
There is no reason to use float numbers in the first place anyway. — Lundin, Sep 29 '16 at 13:25
`for(n,n<1000000,n++){` most people would not use commas here, but semicolons : `for(int n; n<1000000; n++){` — joop, Sep 29 '16 at 13:31
It may sound basic, but... did you compile your program with debug options? If so, turn debug options off. Copying 12 bytes 1 million times (even calling that `rand()`) doesn't sound a really heavy job to me. Maybe use a profiler and make tests with or without the `rand()` or the `memcpy()`, to see where exactly most of the time is spent. Also, declaring the triplet as a `struct` and performing an assignment instead may be faster than calling `memcpy()`. — Constantine Georgiou, Sep 29 '16 at 14:18
In which language? And what's in that `` you haven't shown us? — Toby Speight, Sep 29 '16 at 20:04

score 4 · Accepted Answer · answered Sep 29 '16 at 13:20

4

Getting rid of the floating point is one obvious improvement you should do. That part looks fishy, I assume you want a 50% probability that the code will copy the data?

The branch itself could be removed with some silly trick like:

int do_copy = rand() % 2;
memcpy(cp[n], p[n], 12*do_copy);

However, I wouldn't write such code without looking at the disassembly of the optimized code first.

answered Sep 29 '16 at 13:20

Lundin

195,001
40
254
396

Would `rand() & 1` be a further improvement or is `% 2` just as efficient? – Support Ukraine Sep 29 '16 at 13:33
`&1` vs. `%2`: No difference on a modern compiler. – peterchen Sep 29 '16 at 13:34
@4386427 Any decent compiler should be able to optimize that. If your compiler does not optimize it, then indeed a bit check is much faster than a division on most CPUs. – Lundin Sep 29 '16 at 13:34
1

Calculating the length instead of a compile-time constant may actually make things worse, depending on how memcpy is unrolled. – peterchen Sep 29 '16 at 13:35
@peterchen That's true. It generally doesn't make sense to speak about optimizations like these without a specific system and compiler in mind. – Lundin Sep 29 '16 at 13:37
This shaved 35% off the execution time of my loop. Nicely done. – Andreas Sep 29 '16 at 21:26
@Andreas The most important part when doing these kind of optimizations is perhaps too look at the disassembly from the optimized code. You don't even need to know much about assembler, counting the number of instructions is a very good way to measure performance. Now of course, that's not the whole story - you won't be able to see how well the branch prediction or instruction cache work etc. But you would see how much overhead code the floating point added. – Lundin Sep 30 '16 at 07:01
@Lundin Already did this to check the answer. This has the least move instructions and removes a conditional jump which was causing the trouble. – Andreas Sep 30 '16 at 11:32

Peter · Answer 2 · 2016-09-30T13:03:19.750

1

rand() is very likely the bottleneck of this code. Since you only need a binary decision, consider using all the bits of a single random number to amortize the cost of the random number generation.

for(int n=0; n<1000000; n+=NUM_BITS){
    uint32_t rand_val = static_cast<uint32_t>(rand()); // Edited based on comments
    for(int j=0; j<NUM_BITS; j++) {
        if((rand_val >> j) % 2) {
            memcpy(cp[n+j], p[n+j], 12);
        }
    }
}

The only trick is figuring out NUM_BITS from RAND_MAX, and deciding how high-quality and portable you want this. Choose NUM_BITS so that 1<<NUM_BITS is less than RAND_MAX. Note that this version assumes an even division of NUM_BITS into the total number of samples. Checking this limit or writing a loop prolog to accommodate the partial is left as an exercise for the OP.

My Linux docs warn me that older versions of rand() did not have high quality randomness for all bits of the number, but that it is now fixed. If you care about high-quality randomness, pay attention to this.

You might also look for a faster random generator (they exist) if quality of randomness isn't particularly important.

edited Sep 30 '16 at 13:03

answered Sep 29 '16 at 13:43

Peter

14,559
35
55

`rand()` is usually an integer multiply and an addition with an implicit modulo. But good idea! – peterchen Sep 29 '16 at 13:44
This adds a whole lot more branching however. You'd have to compare the overhead caused by `rand()` with the all the lost branch prediction to see which is more efficient. – Lundin Sep 29 '16 at 13:57
I suppose that `cp[n+j], p[n+j]` might also cause worse data caching. If so, then that would likely be a major bottleneck more significant than anything else in that loop. – Lundin Sep 29 '16 at 13:58
Profiling shows that rand accounts for ~70% of the runtime of the original question – Peter Sep 29 '16 at 16:37
Profiling shows this version drops the runtime to 33% of the original. – Peter Sep 29 '16 at 16:41
It is certainly a nice trick to use for the case where you just want a true/false randomized. However, keep in mind that `int` is signed and bit-shifting a negative int is dangerous. Ideally, cast the result of `rand()` to uint32_t or equivalent. – Lundin Sep 30 '16 at 07:04

score 1 · Answer 3 · answered Sep 29 '16 at 13:57

Hard to provide a complete answer.

(Commentary) I assume the rand is only a placeholder for an external 50/50 decision, nor for productive use?

Otherwise, be aware that rand() sucks. It's good for making numbers look random to a moron in a hurry. Avoid the floating point division. rand()%2 is generally a bit worse than rand()>RAND_MAX/2, but that difference rarely matters.

(Commentary) you assume that sizeof(int)==4. not great.
Is there a reason to not just copy the entire buffer?

A single large copy might be faster than many small ones, even if it touches double the data.

i.e. if the uncopied elements aren't going to be used, it doesn't matter if the original data is in there. OTOH, if the uncopied elements must not be overwritten, this does not apply.

replace the memcpy with 3 integer assignments.

Good compilers should be able to do that in most scenarios like yours right now, but memcpy can get a little complex. (It needs to check for odd lengths, might need to check for unaligned reads, etc.)

This allows the three assignments to use the multiple units per core in parallel.

big optimization potential for parallelizing (but cache)

If you can make the random number generation non-sequential - e.g. by using 4 independent generators - could distribute the load over multiple threads, each processing one chunk of the data.

The branch could be avoided by copying to a dummy-buffer instead

It's an interesting idea, I'm not sure if it buys you too much, though:

int dummyBuffer[3];
for(...)
{
  int * target = (rand() % 2) ? dummyBuffer : cp+n;
  //  <-- replace with arithmetic trickery to avoid the branch
  target[0] = p[n][0];
  target[1] = p[n][1];
  target[2] = p[n][2];
}

(As written, the branch will be moved to the assignment of "target", not much of a win. However, you probably know / can construct some trickery to make this assignment branch-free)

how would you optimize this function?

3 Answers3