TL;DR: None of the two functions are actually using SIMD instructions. As a result they are sub-optimal. At least, not on Windows with Numpy 1.24.3. Besides, np.mean
only read data from memory while the bitwise OR produce a newly allocated output array which is quite expensive.
Under the hood
First of all, np.mean
need to convert the integer values to floating-point ones so to then perform the actual sum and finally compute the mean. The assembly code executed (hot loop) for the conversion is the following:
Block 3:
mov rcx, qword ptr [r9] ; Load the integer item from memory
dec r10
xorps xmm0, xmm0
test rcx, rcx ; Strange sanity check
js 0x18011b0a7 <Block 5>
Block 4:
cvtsi2sd xmm0, rcx ; Convert the uint64 value to a double
jmp 0x18011b0bc <Block 6>
Block 5:
[NEVER ACTUALLY EXECUTED]
Block 6:
movsd qword ptr [rdx+r9*1], xmm0 ; Store the result back in memory
add r9, 0x8
test r10, r10
jnz 0x18011b092 <Block 3>
This loop is inefficient because it is fully scalar, it contains a conditional jump (which seems not needed at first glance) as well as useless instructions like xorps
, and it is not unrolled. This code is certainly not automatically vectorized by the compiler due to the strange condition.
The assembly code executed (hot loop) for the mean is the following:
Block 12:
prefetcht0 zmmword ptr [r15+rcx*1]
vaddsd xmm5, xmm5, qword ptr [rcx] ; Block of scalar additions
vaddsd xmm2, xmm2, qword ptr [rcx+rdi*1] ; which also load data from memory
vaddsd xmm6, xmm6, qword ptr [rcx+rdi*2]
vaddsd xmm0, xmm0, qword ptr [r9+rcx*1]
vaddsd xmm3, xmm3, qword ptr [r10+rcx*1]
vaddsd xmm1, xmm1, qword ptr [rbx+rcx*1]
vaddsd xmm7, xmm7, qword ptr [rcx+rbp*2]
mov rax, rcx
add rcx, r14
sub rax, rdi
vaddsd xmm4, xmm4, qword ptr [rax]
sub r11, 0x1
jnz 0x18003a0b0 <Block 12>
This loop compute the scalar sum of 8 double-precision numbers. It is not optimal, but it is relatively good in practice on most machines. This function performs a pair-wise sum instead of a naive sum so to be more accurate. This code is not automatically vectorized by the compiler because of the access pattern. One certainly need to manually write SIMD intrinsics to speed it up. However, this has certainly a significant maintainability overhead ( should we keep the scalar version and should we need to support several different types floating-point types?).
The assembly code executed (hot loop) for the bitwise OR is the following:
Block 39:
mov rcx, r10 ; r10 is your scalar integer
lea r9, ptr [r9+0x8]
or rcx, qword ptr [rax] ; Bitwise OR reading data from memory
lea rax, ptr [rax+0x8]
mov qword ptr [r9-0x8], rcx ; Store the result back in memory
sub r8, 0x1
jnz 0x180187891 <Block 39>
This loop is not efficient : it is a scalar loop which is not unrolled. I guess this loop is not automatically vectorized by the compiler certainly because one of the input value is a scalar. This is sad because this is relatively easy for compilers to use SIMD instructions in this case.
I submitted a PR that might improve the speed of theses function, but I do not expect a huge performance increase in this specific case.
On top of that, the bitwise OR produce a newly allocated output array which is quite expensive. Indeed, filling large newly-allocated arrays cause page-faults. On Windows, a kernel thread is (mainly) responsible to do that and work about half the time of the computation.
Moreover, scalar writes are also inefficient on x86-64 platforms because cache-line write allocations cause cache-lines to be filled from the RAM before writing them back. This means half the RAM bandwidth is wasted. There are non-temporal store SIMD instructions to avoid this problem but this require the code to be vectorized and to know that the output array will not be reused from a cache later (which is hard to know since it is dependent of the actual processor used, the input array size and even the user code).
A similar behaviour can be seen on Linux with Numpy 1.24.2 (though results are significantly better).
Workarounds
Regarding the mean, you can use np.sum(bitmap)/bitmap.size
if you know that the sum does not overflow. This is about 2.5 times faster on my machine (i5-9600KF).
Regarding the bitwise OR, you can preallocate the output array and reuse it so to avoid page-faults: np.bitwise_or(bitmap, np.uint64(6540943), out)
where out
is the output array. This is 2.7 times faster on my machine. If you do not care about keeping bitmap
values, then please perform an in-place operation : np.bitwise_or(bitmap, np.uint64(6540943), bitmap)
. This avoid the overhead of the cache-line write allocates. If you need to write the result in another array, then non-temporal store SIMD instructions are AFAIK the only way to speed the computation up on x86-64 processor. They are available in native language C/C++ (using non-portable SIMD intrinsics or compiler annotations). If you cannot preallocate the output array, then filling it in parallel is faster on some systems (not much on Windows but significantly on Linux).
In the end, it turns out that page-faults and cache-line write allocations are the biggest overheads on my machine. While Numpy can be improved so to use SIMD instructions in this case, using scalar instructions is not so inefficient in practice (due to the RAM being slow). Processor running at a significantly lower frequency (e.g. 3 GHz) can still significantly benefit from an SIMD implementation.
Efficient set intersection computation
Based on the title, the question and comments, it looks like you want to perform a set intersection equivalent to np.any((bitmap & query) != 0)
. To do it efficiently, then the best solution is not to create/fill big temporary arrays in the first place as mentioned by @PeterCordes in the comments. This remove the overheads coming from page-faults and cache-line write allocations, while reducing the RAM pressure (so arrays can be read faster). Computing arrays chunk by chunk is possible (and certainly faster), but not optimal because the overhead of Numpy is significant on small arrays. Numba and Cython can do it quite efficiently. Here is a naive Numba implementation:
import numba as nb
@nb.njit('(uint64[::1], uint64[::1])')
def compute(a, b):
assert a.size == b.size
x = np.uint64(0)
for i in range(a.size):
x |= a[i] | b[i]
return x != 0