Performance difference between bitwise XOR vs summations seems off

Question

I recently saw a nice solution to a programming problem. Given 2 lists, find the missing number in one of the lists.

My initial solution was like this:

long long missing_2(long long a[], long long b[], long long bsize) {
    long long asum = a[bsize], bsum = 0;
    for (long long i = 0; i < bsize; i++) {
        asum += a[i];
        bsum += b[i];
    }
    return asum-bsum;
}

but someone suggested something like this:

long long missing_3(long long a[], long long b[], long long bsize) {
    long long sum = 0 ^ a[bsize];
    for (long long i = 0; i < bsize; i++) {
        sum ^= a[i];
        sum ^= b[i];
    }
    return sum;
}

out of curiosity, I timed the 2 solutions thinking that the second one missing_3 would be faster. I got these results

missing_2: Time taken: 16.21s

missing_3: Time taken: 23.39s

The lists are generated using a for-loop. List b is filled with ints 0-1000000000 and list a is filled 1-1000000000 with a random number appended at the end (so that it contains 1 different (extra) value.

Question The bitwise version takes 23.39s, the summation version takes 16.21s. Any idea why the bitwise version would be noticeably slower than the summation version? I would have assumed bitwise operations would be faster than addition or at least similar.

Edit:

compiled using g++ with no extra flag/options.

Edit2:

Tested with flags -O1 and -O2, no noticeable difference.

Edit3:

Here is the driver code:

    long long smaller_size = 1000000000;
    
    long long* a = new long long[smaller_size+1];
    long long* b = new long long[smaller_size];

    for(long long i = 0; i < smaller_size; i++){
        a[i] = b[i] = i;
    }
    a[smaller_size] = 1434;
    // std::cout << missing_1(a, b, smaller_size) << std::endl;
    clock_t tStart = clock();
    std::cout << "Start List Test\n";

    std::cout << missing_2(a, b, smaller_size) << std::endl;
    printf("Time taken: %.2fs\n", (double)(clock() - tStart)/CLOCKS_PER_SEC);
    tStart = clock();
    std::cout << missing_3(a, b, smaller_size) << std::endl;
    printf("Time taken: %.2fs\n", (double)(clock() - tStart)/CLOCKS_PER_SEC);

Edit 5: Timing section updated (No time change)

    clock_t tStart = clock();
    std::cout << "Start List Test\n";
    double time; 

    missing_2(a, b, smaller_size);
    time = (double)(clock() - tStart)/CLOCKS_PER_SEC;
    printf("Time taken: %.2fs\n", time);
    tStart = clock();
    missing_3(a, b, smaller_size);
    time = (double)(clock() - tStart)/CLOCKS_PER_SEC;
    printf("Time taken: %.2fs\n", time);

It's not much slower. Did you compile with optimisations turned on? — , Apr 14 '19 at 22:22
@NeilButterworth This is with a billion but it's clear that with higher values the summation version would start pulling away. With 100 Million it's 26s vs 33s. I didn't change anything with the compiler. That's a good point. — huddie96, Apr 14 '19 at 22:25
Someone with rep of over 500 should know by now that anything to do with timing C++ applications should be accompanied by the compiler options they used to build the application. — PaulMcKenzie, Apr 14 '19 at 22:26
*compiled using g++ with no extra flag/options.* -- Then rebuild your program using `-O2` or `-O3` and rerun your tests. — PaulMcKenzie, Apr 14 '19 at 22:28
I don't keep up to date with current ALU implementations, but the xor version looks like it should lose a tick due to pipelining. I.e xor b must wait for the result of xor a. — Peter, Apr 14 '19 at 22:29
@PaulMcKenzie no noticeable difference with `-O2` or `-O3`. I will add that as an edit to my post. @Peter, neither do I but if that's true that would make sense, cool thought! — huddie96, Apr 14 '19 at 22:36
My understanding is an XOR operation is faster then an ADD or subtract operation. However, we are talking in units of nanoseconds (or faster). IMHO, unless you have 1e+09 operations (or more), the difference between using XOR or add/subtract is insignificant. The speed you gain will be wasted in other parts of the program or by the operating system in swapping out your program. — Thomas Matthews, Apr 14 '19 at 22:39
@huddie96 Another thing you're missing -- *how* are you timing these tests? I see no (C++) code that starts and stops a timer. So what are you timing exactly, or more succinctly, what are you throwing into the mix that has nothing to do with the computations? — PaulMcKenzie, Apr 14 '19 at 22:39
@PaulMcKenzie not sure what you mean in #1. I'm creating a variable, its name is `asum` its type is `long long` and it's value is `a[bsize]`. I will add the rest of my timing code to the qustion. — huddie96, Apr 14 '19 at 22:41
You should also run the tests unrolling the loop and with instructions to parallelize the operation. Declare the arrays as constant, which should help the compiler optimize better. — Thomas Matthews, Apr 14 '19 at 22:42
ok, so is `bsize` within bounds of the `a` array? We really need a [mcve]. Anything that can go wrong that you may not have noticed could impact your results. — PaulMcKenzie, Apr 14 '19 at 22:42
@huddie96: arrays must have a fixed size capacity defined by a compile-time constant. Otherwise use a dynamic array (e.g.`new`), `std::vector` or `std::array`. — Thomas Matthews, Apr 14 '19 at 22:43
@ThomasMatthews I think i do use `new`, can you show me where your speaking of? — huddie96, Apr 14 '19 at 22:44
@huddie96 -- So you're timing `cout` statements also? You should be timing just the "raw" computation code, not output statements. — PaulMcKenzie, Apr 14 '19 at 22:45
@PaulMcKenzie that's a great point. I didn't think of that at all. I'll fix that. — huddie96, Apr 14 '19 at 22:47
A variable length array example: `int b = 25; int a[b];`. Using `new` : `int * c = new int[36];` Vector: `std::vector v;` — Thomas Matthews, Apr 14 '19 at 22:47
@ThomasMatthews don't I use that? I'm a bit confused as to where I am not doing that. — huddie96, Apr 14 '19 at 22:50
BTW, if `bsize` is the capacity of your array, then `a[bsize]` is undefined access (your accessing past the end of the array). — Thomas Matthews, Apr 14 '19 at 22:51
@PaulMcKenzie I updated the driver to not time the `cout`, no time change. @ThomasMatthewsm bsize is the size of array `b`. Array `a` is size `bsize+1` — huddie96, Apr 14 '19 at 22:52
As has been said, it's clearly a pipelining issue - you have a data dependency in your loop body, that is also carried forward through the next iteration. In the first example instead the loop body can go fully in parallel. — Matteo Italia, Apr 14 '19 at 23:13
@Peter or Matteo, do one of you want to submit an answer so I can accept it. Makes sense that it's pipelining. Probably never would have thought of that myself tbh, smart answer. — huddie96, Apr 14 '19 at 23:18
@huddie96 I'll pass, as I don't know enough about recent architectures to write a good quality answer, and am currently on my phone, sorry. I won't take any offense if anyone wants to write an answer inspired by a comment of mine. — Peter, Apr 14 '19 at 23:24
we have not used any parallel processing thing here, Full Adder needs 2 XOR gates, 2 AND gates, and 1 OR gate. — Rohit gupta, May 03 '23 at 06:55

huddie96 · Accepted Answer · 2019-04-15T00:54:15.553

As was mentioned in the questions comments by Peter and Matteo.

Although XOR would be equal/faster for a single operation vs addition. The XOR version is slower than the summation version because of pipelining.

One XOR must wait for the previous XOR operation to finish before it can start its operation whereas the summation (addition) can run in parallel and therefore make use of pipelining. In turn, allowing summation to end up faster than the bitwise XOR.

This answer is inspired by Peters comment. I just wanted to make sure the question had an available answer and Peter opted not to write one.

Edit:

Thanks to phuclv, altering the code to sum array a and b individually (see comment below) breaks the dependency and results in the XOR returning quicker/equal than Summation.

if it's really because of the dependency then break the dependency by `{ sumA ^= a[i]; sumB ^= b[i]; } return sumA ^ sumB;` — phuclv, Apr 15 '19 at 00:31

Performance difference between bitwise XOR vs summations seems off

1 Answers1

Linked