Is faster summation on 0's 1's possible?

Question

I have a very large array (e.g. 10 million elements) which consists of only 1's and 0's. I also have a bunch of parallel threads (e.g. 10) and I would like to chunk this large array into different threads and make each of them to sum the portion they are responsible of.

I've coded the problem in C & pthreads by using "+" operator. However, since the array only consists of 1's and 0's, I wonder can there be a faster way to implement this summation? (via bitwise operators, shifting, etc.?) Since I am dealing with very large arrays, the naive summation is killing the performance.

So this is an array of bits? Chars? Ints? Floats? Character strings? There's one 1/0 in each array element, or 32 of them, or what? — Hot Licks, Mar 31 '12 at 21:12
With an array that large, you may find that the bottleneck is memory access...in which case splitting the work into a bunch of threads (even on a multi-core CPU) just gets you a bunch of threads that run in roughly the same amount of time. — cHao, Mar 31 '12 at 21:12
@cHao, actually that's utterly false. A single core of a multi-core chip *cannot* saturate the memory bandwidth available to the whole chip. For instance, Nehalem processors have a limit of 10 outstanding loads per core, but the memory controller can service a few times that. When a core hits that limit, it simply stalls until some operations complete. http://dx.doi.org/10.1109/ISPASS.2010.5452064 — Phil Miller, Mar 31 '12 at 21:37
@Novelocrat: If all the cores are stalling, your threads aren't going to run any faster, are they? — cHao, Mar 31 '12 at 21:46
The point is that you actually need to parallelize to the point that all of the cores are stalling, because just one or two running at their limit doesn't actually saturate the limiting resource. — Phil Miller, Mar 31 '12 at 21:52
How about all of them? 10 threads all running at once (and not having to block for anything but memory access) could easily max out the CPU, no? And BTW, when did memory access get faster than CPU speed? Last i checked, it's been the other way around for like 20 years. — cHao, Mar 31 '12 at 21:55
I think we're speaking past each other. I agree with you that memory access will be the limiting factor for this problem, on one core or several. What I'm trying to convey, and what the linked paper supports, is that multiple cores of a single processor chip actually offer a higher memory bandwidth than the single core can attain. — Phil Miller, Mar 31 '12 at 22:13
Following up, the bottom line is that multiple threads should actually provide a speedup over a single thread, because the total memory bandwidth available to them is greater than in the single-thread case. — Phil Miller, Mar 31 '12 at 22:24
@Novelocrat -- On a multicore... The memory bandwidth is available to the core, not the thread. — Hot Licks, Mar 31 '12 at 22:38
Do I understand you right that you have an array of ints (presumably at least 32-bits), but each one is only 1-or-0? And you're worried about efficiency? You realize you could collapse the size of the problem at least 32x just by using bits instead of ints. — abelenky, Mar 31 '12 at 22:47
@abelenky: That could easily resolve the memory bandwidth issue...at that point, the biggest slowdown would be the popcount, and that could be parallelized easier (and that's if the CPU doesn't already have a popcount instruction, which apparently Nehalem-based and later CPUs do). — cHao, Apr 01 '12 at 01:40

score 6 · Answer 1 · answered Mar 31 '12 at 21:12

6

You're adding 2 arrays of 10 million elements... on a modern CPU that can execute around 3 BILLION instructions per second (3GHz).

Even if each single element had to be added individually, you could add two entire arrays in in 0.003 seconds. (and that is really a worst-case scenario. On a 64-bit machine, you should be able to add 64-elements at a time)

Unless this is happening inside an inner loop, this should NOT be killing performance.

Consider describing your problem more fully, and showing your current implementation.

answered Mar 31 '12 at 21:12

abelenky

63,815
23
109
159

To reinforce your point, that this sort of thing is _very_ quick, I just did a simple version where each bit was stored in a char, on a 4 year old 2.2GHz Core 2 Duo. Timing the entire program (using time from the command line), the results are: real 0m0.026s, user 0m0.010s, sys 0m0.015s. System time is bigger than user time! I used a bit of I/O to stop the compiler getting too sneaky. – gbulmer Mar 31 '12 at 21:30
2

C if often used for embedded programming on some special devices. So, the processor could be much slower than that. IMHO, the questionner should explain this point to us. – Gangnus Mar 31 '12 at 21:36
3 ms is *not* a short time, when your computer is doing serious work. Among other applications, my lab works on on a molecular dynamics simulation program that can calculate the interactions of 100 million atoms using supercomputers of hundreds of thousands of cores, and the iteration timesteps are on the order of single-digit milliseconds. If we were doing a summation poorly and it were costing us even a small fraction of that, we wouldn't be able to achieve these sorts of results. – Phil Miller Mar 31 '12 at 21:41
1

@abelenky Sorry for not describing the problem fully. It's because this summation is just a small piece of the code I am implementing; parallel quicksort, in particular. I can explain the implementation details of it, but I guess it is not related with the question (I am following page 18 of this paper: www.cs.cmu.edu/~blelloch/papers/Ble93.pdf). But even 0.00x seconds is important for me to improve the speedup of the program, so I am trying to squeeze the juice of every piece of operation in the code. So my question is simply: is there a way to do the summation faster then just using "+". – IronButterfly Mar 31 '12 at 21:43
@Novelocrat: The problem you describe involves huge numbers of iterations, so of course it is important that the inner-most operation run quickly. The poster in this question mentioned NOTHING about an outer loop performing this operation repeatedly. – abelenky Mar 31 '12 at 23:21

score 1 · Answer 2 · answered Mar 31 '12 at 21:50

First, convert to doing a SIMD vector sum, and reduce the elements of the vector register to a single sum at the end, outside your loop. That should get you the same result in 1/4 the operations. Then unroll that vectorized loop, with each unrolled iteration summing in a separate vector, to expose greater instruction-level parallelism, and combine the partial sums at the end. With that, you should pretty easily max out memory bandwidth.

score 1 · Answer 3 · edited May 23 '17 at 12:11

If you can go over to use all the bits instead of 1 per int, the performance could at least be increased ;)

Also tested with SSE, __m128i _mm_add_epi32, registers, etc, etc, (eka) but didn't manage to get any notable boost. ( Very probable I didn't do some of that that correctly. ).

Everything depends a lot on environment how the array is created, how it is used elsewhere, etc, etc. One could i.e. look into GPU processing, but that again becomes specialized, and probably better utilized on heavier calculations then +.

Anyhow here is rough sample result I did on an P4 2.8GHz with 2G slow SDRAM; using normal 1 increment loop, unroll of 2 and 8 (on one digit pr. int) , and second bit twiddle from CountBitsSetParallel combined with unroll. Both threaded and not. Be careful with bit-twiddling if you decide to combine it with threads.

./bcn -z330000000 -s3 -i1
sz_i      : 330000000 * 4 = 1320000000 (bytes int array)
sz_bi     :  10312500 * 4 =   41250000 (bytes bit array)
set every :         3 (+ 1 controll-bit)
iterations:         1

Allocated 1320000000 bytes for ari    (0x68cff008 - 0xb77d8a08)
            1289062 KiB
               1258 MiB
                  1 GiB
Allocated  41250000 bytes for arbi   (0x665a8008 - 0x68cfecd8)
              40283 KiB
                 39 MiB
Setting values ...
--START--
    1 iteration over 330,000,000 values
Running TEST_00 Int Normal    ; sum = 110000001 ... time: 0.618463440
Running TEST_01 Int Unroll 2  ; sum = 110000001 ... time: 0.443277919
Running TEST_02 Int Unroll 8  ; sum = 110000001 ... time: 0.425574923
Running TEST_03 Int Bit Calc  ; sum = 110000001 ... time: 0.068396207
Running TEST_04 Int Bit Table ; sum = 110000001 ... time: 0.056727713

...

    1 iteration over 200,000,000
Running TEST_00 Int Normal    ; sum = 66666668 ... time: 0.339017852
Running TEST_01 Int Unroll 2  ; sum = 66666668 ... time: 0.273805886
Running TEST_02 Int Unroll 8  ; sum = 66666668 ... time: 0.264436688
Running TEST_03 Int Bit Calc  ; sum = 66666668 ... time: 0.032404574
Running TEST_04 Int Bit Table ; sum = 66666668 ... time: 0.034900498

...

  100 iterations over 2,000,000 values
Running TEST_00 Int Normal    ; sum = 666668 ... time: 0.373892700
Running TEST_01 Int Unroll 2  ; sum = 666668 ... time: 0.270294678
Running TEST_02 Int Unroll 8  ; sum = 666668 ... time: 0.260143237
Running TEST_03 Int Bit Calc  ; sum = 666668 ... time: 0.031871318
Running TEST_04 Int Bit Table ; sum = 666668 ... time: 0.035358995

...

    1 iteration over 10,000,000 values
Running TEST_00 Int Normal    ; sum = 3333335 ... time: 0.023332354
Running TEST_01 Int Unroll 2  ; sum = 3333335 ... time: 0.011932137
Running TEST_02 Int Unroll 8  ; sum = 3333335 ... time: 0.013220130
Running TEST_03 Int Bit Calc  ; sum = 3333335 ... time: 0.002068979
Running TEST_04 Int Bit Table ; sum = 3333335 ... time: 0.001758484

Threads ...

 4 threads, 1 iteration pr. thread over 200,000,000 values
Running TEST_00 Int Normal    ; sum = 66666668 ... time: 0.285753177
Running TEST_01 Int Unroll 2  ; sum = 66666668 ... time: 0.263798773
Running TEST_02 Int Unroll 8  ; sum = 66666668 ... time: 0.254483912
Running TEST_03 Int Bit Calc  ; sum = 66666668 ... time: 0.031457365
Running TEST_04 Int Bit Table ; sum = 66666668 ... time: 0.036319760

Snip (Sorry for short naming):

/* I used an array named "ari" for integer 1 value based array, and
   "arbi" for integer array with bits set to 0 or 1.

   #define SZ_I : number of elements (int based)
   #define SZ_BI: number of elements (bit based) on number of SZ_I, or
      as I did also by user input (argv)
 */

#define INT_BIT     (CHAR_BIT * sizeof(int))

#define SZ_I    (100000000U)
#define SZ_BI   ((SZ_I / INT_BIT ) + (SZ_I / INT_BIT  * INT_BIT  != SZ_I))

static unsigned int sz_i  = SZ_I;
static unsigned int sz_bi = SZ_BI;

static unsigned int   *ari;
static unsigned int   *arbi;

/* (if value (sz_i) from argv ) */
sz_bi = sz_i  / INT_BIT + (sz_i / INT_BIT  * INT_BIT  != sz_i);

...
#define UNROLL  8


static __inline__ unsigned int bitcnt(unsigned int v)
{
    v = v - ((v >> 1) & 0x55555555);
    v = (v & 0x33333333) + ((v >> 2) & 0x33333333);
    return (((v + (v >> 4)) & 0xF0F0F0F) * 0x1010101) >> 24;
}

unsigned int test_03(void)
{
    unsigned int i   = 0;
    unsigned int sum = 0;
    unsigned int rep = (sz_bi / UNROLL);
    unsigned int rst = (sz_bi % UNROLL);

    while (rep-- > 0) {
        sum += bitcnt(arbi[i]);
        sum += bitcnt(arbi[i+1]);
        sum += bitcnt(arbi[i+2]);
        sum += bitcnt(arbi[i+3]);
        sum += bitcnt(arbi[i+4]);
        sum += bitcnt(arbi[i+5]);
        sum += bitcnt(arbi[i+6]);
        sum += bitcnt(arbi[i+7]);
        i += UNROLL;
    }

    switch (rst) {
    case 7: sum += bitcnt(arbi[i+6]);
    case 6: sum += bitcnt(arbi[i+5]);
    case 5: sum += bitcnt(arbi[i+4]);
    case 4: sum += bitcnt(arbi[i+3]);
    case 3: sum += bitcnt(arbi[i+2]);
    case 2: sum += bitcnt(arbi[i+1]);
    case 1: sum += bitcnt(arbi[i]);
    case 0:;
    }

    return sum;
}

Alex · Answer 4 · 2012-03-31T22:01:16.277

0

you mentioned array of ints, so something like this: int array[...];

If you're on 64 bit os+cpu, you may want to typecast it to long long (or __int64, depending on your platform) - basically 8-bit integers. So you do this:

int array[...];
...
unsigned long long *longArray;
unsigned long long sum;
for (longArray = &array[number of elements in array]; longArray != array;)
{
    --longArray;
    sum += *longArray;
}

if (the number of elements in original array % 2 == 1)
    sum += array[number of elements in original array - 1];
sum = (sum >> 32) + (sum & 0xFFFFFFFF); // basically add up the first 4 bytes and second 4 bytes
return sum;

But try it, I'm not entirely sure it will be faster.

edited Mar 31 '12 at 22:01

answered Mar 31 '12 at 21:55

Alex

1,192
14
30

This is a pretty rough way to do it, because each iteration carries two dependences on the previous iteration, that the compiler would have to prove that it can break to optimize this loop. My answer explains that you can access more elements of the array per loop iteration (4-wide vectors), and you need to sum those elements to separate accumulators. Incidentally, all that unrolling also makes loop overhead a much smaller fraction of the total effort, which your pointer arithmetic (which can mess with some compiler optimizations) only starts to touch. – Phil Miller Mar 31 '12 at 22:19

score 0 · Answer 5 · answered Mar 31 '12 at 22:31

If by summing each value you mean, basically, counting how many 1s are there, then I think the only way is to add each values from the array/chunk... reimplementing the add instruction with bitwise operators, I think it could be just slower than using the add of the cpu; but maybe it could depend on the cpu.

Also, skipping 0s is not faster than adding them (jumps are slow)...

The only thing that comes to my mind and that could speed it up, is to pack (from the beginning) the data in a way you can exploit special instruction of your target CPU. Some CPUs have instructions that make it easy (and fast, I suppose) to get the population count of a register. A 32 bit register can hold 32 bits (32 elements of your array) and you can "sum" them with a single instruction (of specific CPUs...). Then you of course have to sum the result to the "global partial" result of the thread; anyway this way you reduce the number of adds instruction (32 adds become 1 single instruction). This should work altogether with the answer of Novelocrat (e.g. the element of the vector register are the result of the population count).

Recent "x86" cpus have population count instruction, see this link on wikipedia.

score 0 · Answer 6 · answered Mar 31 '12 at 22:35

On most processors the add instruction is one of the fastest. The logic surrounding calculating the element address, fetching the element, widening it as needed, et al, will swamp the actual add by a factor of 4-10 (and even more if the compiler inserts array bounds checks, interrupt points, etc).

The first thing to do, of course, is to convert the array index into a pointer increment. Next, one would unroll the loop maybe 20 times. A good optimizing compiler might possibly do something equivalent, but in this case you may (or may not) be able to do it a hair better.

Another trick, especially if you have an array of bytes, is to do something similar to what Novelocrat suggests (and apparently what Alex suggests) -- coerce the array pointer to a pointer to long and fetch more than one array element at a time, then add multiple elements (4 in the case of bytes) at the same time with one operation. Of course, with bytes you'd have to stop at least every 255 iterations and split things up, to keep one byte from overflowing into the next.

You do need to beware of keeping too many values "in the air" in a single thread. There are only so many processor registers, and you want to keep all your values (element pointer, iteration counter, accumulator, scratch reg for the fetched element, et al) in registers.

But very quickly storage access will become the bottleneck.

Unrolling by 20 will absolutely blow out available registers - x86_64 only offer 16 GPRs, and a similar number of vector registers. Converting array indexing to pointer arithmetic is a *bad* idea, because compilers know how to optimize array indexing (their developers have >30 years experience doing that well), while pointer optimizations require much deeper analysis. — Phil Miller, Mar 31 '12 at 22:47
@Novelocrat -- But unrolling doesn't take any more registers, if done correctly. And, as I said with regard to the pointer conversion, it may OR MAY NOT improve things. You'd really have to examine the generated code to know. Sometimes compilers are exceedingly smart, and sometimes they are exceedingly stupid (speaking as someone with only maybe 15 years experience with compiler optimization). — Hot Licks, Mar 31 '12 at 23:10

score 0 · Answer 7 · answered Mar 31 '12 at 23:04

You might find that array indexing generates better code than pointer indexing. Take a look at the assembler generated by the compiler to be sure. With gcc this is the -S option. On my iMac using gcc v4.2.1, I'm seeing indexing generating shorter code, although as I don't know x86 assembler I can't say whether it is actually quicker.

BTW, is the int array mandated by hardware or external constraints?

Is faster summation on 0's 1's possible?

7 Answers7