If you can go over to use all the bits instead of 1 per int, the performance could at least be increased ;)
Also tested with SSE, __m128i _mm_add_epi32
, registers, etc, etc, (eka) but didn't manage to get any notable boost. ( Very probable I didn't do some of that that correctly. ).
Everything depends a lot on environment how the array is created, how it is used elsewhere, etc, etc. One could i.e. look into GPU processing, but that again becomes specialized, and probably better utilized on heavier calculations then +
.
Anyhow here is rough sample result I did on an P4 2.8GHz with 2G slow SDRAM; using normal 1 increment loop, unroll of 2 and 8 (on one digit pr. int) , and second bit twiddle from CountBitsSetParallel combined with unroll. Both threaded and not. Be careful with bit-twiddling if you decide to combine it with threads.
./bcn -z330000000 -s3 -i1
sz_i : 330000000 * 4 = 1320000000 (bytes int array)
sz_bi : 10312500 * 4 = 41250000 (bytes bit array)
set every : 3 (+ 1 controll-bit)
iterations: 1
Allocated 1320000000 bytes for ari (0x68cff008 - 0xb77d8a08)
1289062 KiB
1258 MiB
1 GiB
Allocated 41250000 bytes for arbi (0x665a8008 - 0x68cfecd8)
40283 KiB
39 MiB
Setting values ...
--START--
1 iteration over 330,000,000 values
Running TEST_00 Int Normal ; sum = 110000001 ... time: 0.618463440
Running TEST_01 Int Unroll 2 ; sum = 110000001 ... time: 0.443277919
Running TEST_02 Int Unroll 8 ; sum = 110000001 ... time: 0.425574923
Running TEST_03 Int Bit Calc ; sum = 110000001 ... time: 0.068396207
Running TEST_04 Int Bit Table ; sum = 110000001 ... time: 0.056727713
...
1 iteration over 200,000,000
Running TEST_00 Int Normal ; sum = 66666668 ... time: 0.339017852
Running TEST_01 Int Unroll 2 ; sum = 66666668 ... time: 0.273805886
Running TEST_02 Int Unroll 8 ; sum = 66666668 ... time: 0.264436688
Running TEST_03 Int Bit Calc ; sum = 66666668 ... time: 0.032404574
Running TEST_04 Int Bit Table ; sum = 66666668 ... time: 0.034900498
...
100 iterations over 2,000,000 values
Running TEST_00 Int Normal ; sum = 666668 ... time: 0.373892700
Running TEST_01 Int Unroll 2 ; sum = 666668 ... time: 0.270294678
Running TEST_02 Int Unroll 8 ; sum = 666668 ... time: 0.260143237
Running TEST_03 Int Bit Calc ; sum = 666668 ... time: 0.031871318
Running TEST_04 Int Bit Table ; sum = 666668 ... time: 0.035358995
...
1 iteration over 10,000,000 values
Running TEST_00 Int Normal ; sum = 3333335 ... time: 0.023332354
Running TEST_01 Int Unroll 2 ; sum = 3333335 ... time: 0.011932137
Running TEST_02 Int Unroll 8 ; sum = 3333335 ... time: 0.013220130
Running TEST_03 Int Bit Calc ; sum = 3333335 ... time: 0.002068979
Running TEST_04 Int Bit Table ; sum = 3333335 ... time: 0.001758484
Threads ...
4 threads, 1 iteration pr. thread over 200,000,000 values
Running TEST_00 Int Normal ; sum = 66666668 ... time: 0.285753177
Running TEST_01 Int Unroll 2 ; sum = 66666668 ... time: 0.263798773
Running TEST_02 Int Unroll 8 ; sum = 66666668 ... time: 0.254483912
Running TEST_03 Int Bit Calc ; sum = 66666668 ... time: 0.031457365
Running TEST_04 Int Bit Table ; sum = 66666668 ... time: 0.036319760
Snip (Sorry for short naming):
/* I used an array named "ari" for integer 1 value based array, and
"arbi" for integer array with bits set to 0 or 1.
#define SZ_I : number of elements (int based)
#define SZ_BI: number of elements (bit based) on number of SZ_I, or
as I did also by user input (argv)
*/
#define INT_BIT (CHAR_BIT * sizeof(int))
#define SZ_I (100000000U)
#define SZ_BI ((SZ_I / INT_BIT ) + (SZ_I / INT_BIT * INT_BIT != SZ_I))
static unsigned int sz_i = SZ_I;
static unsigned int sz_bi = SZ_BI;
static unsigned int *ari;
static unsigned int *arbi;
/* (if value (sz_i) from argv ) */
sz_bi = sz_i / INT_BIT + (sz_i / INT_BIT * INT_BIT != sz_i);
...
#define UNROLL 8
static __inline__ unsigned int bitcnt(unsigned int v)
{
v = v - ((v >> 1) & 0x55555555);
v = (v & 0x33333333) + ((v >> 2) & 0x33333333);
return (((v + (v >> 4)) & 0xF0F0F0F) * 0x1010101) >> 24;
}
unsigned int test_03(void)
{
unsigned int i = 0;
unsigned int sum = 0;
unsigned int rep = (sz_bi / UNROLL);
unsigned int rst = (sz_bi % UNROLL);
while (rep-- > 0) {
sum += bitcnt(arbi[i]);
sum += bitcnt(arbi[i+1]);
sum += bitcnt(arbi[i+2]);
sum += bitcnt(arbi[i+3]);
sum += bitcnt(arbi[i+4]);
sum += bitcnt(arbi[i+5]);
sum += bitcnt(arbi[i+6]);
sum += bitcnt(arbi[i+7]);
i += UNROLL;
}
switch (rst) {
case 7: sum += bitcnt(arbi[i+6]);
case 6: sum += bitcnt(arbi[i+5]);
case 5: sum += bitcnt(arbi[i+4]);
case 4: sum += bitcnt(arbi[i+3]);
case 3: sum += bitcnt(arbi[i+2]);
case 2: sum += bitcnt(arbi[i+1]);
case 1: sum += bitcnt(arbi[i]);
case 0:;
}
return sum;
}