vectorized & in numpy

Question

My use case is to use numpy for bitmap (that is, set operations using bit encoding). I use numpy arrays with uint64. If I have a query with 3 entries, I can then do bitmap | query !=0 to check if any element in the query are in the set. Amazing!

Now comes the rub: The performance seems inferior compared to other vectorized operations.

Code:

import numpy as np
N = 7_000_000
def generate_bitmap():
    return np.random.uniform(size=(N)).astype(np.uint64)

bitmap = generate_bitmap()

# A mean operations (well known vectorized operations) runs fast.
%timeit -o np.mean(bitmap)
# <TimeitResult : 7.91 ms ± 66.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)>

# Compare against bitmap, >twice as slow
%timeit -o bitmap | np.uint64(6540943)
# <TimeitResult : 14.6 ms ± 136 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)>

Intel supports the AVX-512 instruction _mm512_and_epi64 so I was expecting the same speed.

Note that `bitmap` is full of 0 because `uniform` generate 0-1 values that are truncated to 0 once they are an integer (see: `np.all(bitmap == 0)`). Are the actual integer value big? — Jérôme Richard, Aug 07 '23 at 17:57
`bitmap | query` will be non-zero for any non-zero query, regardless of `bitmap`. Did you mean `bitmap & query` to check only some bits in each u64? You mention `_mm512_and_epi64` which does bitwise AND, not OR. — Peter Cordes, Aug 07 '23 at 21:19
I used `uniform` when I should have used `randint` (for the intent to make a dummy dataset). Thx for noticing. My original dataset is correctly encoded — Guillaume, Aug 07 '23 at 22:05

Jérôme Richard · Accepted Answer · 2023-08-07T22:13:35.427

TL;DR: None of the two functions are actually using SIMD instructions. As a result they are sub-optimal. At least, not on Windows with Numpy 1.24.3. Besides, np.mean only read data from memory while the bitwise OR produce a newly allocated output array which is quite expensive.

Under the hood

First of all, np.mean need to convert the integer values to floating-point ones so to then perform the actual sum and finally compute the mean. The assembly code executed (hot loop) for the conversion is the following:

Block 3:
    mov rcx, qword ptr [r9]             ; Load the integer item from memory
    dec r10
    xorps xmm0, xmm0
    test rcx, rcx                       ; Strange sanity check
    js 0x18011b0a7 <Block 5>
Block 4:
    cvtsi2sd xmm0, rcx                  ; Convert the uint64 value to a double
    jmp 0x18011b0bc <Block 6>
Block 5:
    [NEVER ACTUALLY EXECUTED]
Block 6:
    movsd qword ptr [rdx+r9*1], xmm0    ; Store the result back in memory
    add r9, 0x8
    test r10, r10
    jnz 0x18011b092 <Block 3>

This loop is inefficient because it is fully scalar, it contains a conditional jump (which seems not needed at first glance) as well as useless instructions like xorps, and it is not unrolled. This code is certainly not automatically vectorized by the compiler due to the strange condition.

The assembly code executed (hot loop) for the mean is the following:

Block 12:                   
    prefetcht0 zmmword ptr [r15+rcx*1]
    vaddsd xmm5, xmm5, qword ptr [rcx]        ; Block of scalar additions
    vaddsd xmm2, xmm2, qword ptr [rcx+rdi*1]  ; which also load data from memory
    vaddsd xmm6, xmm6, qword ptr [rcx+rdi*2]
    vaddsd xmm0, xmm0, qword ptr [r9+rcx*1]
    vaddsd xmm3, xmm3, qword ptr [r10+rcx*1]
    vaddsd xmm1, xmm1, qword ptr [rbx+rcx*1]
    vaddsd xmm7, xmm7, qword ptr [rcx+rbp*2]
    mov rax, rcx
    add rcx, r14
    sub rax, rdi
    vaddsd xmm4, xmm4, qword ptr [rax]
    sub r11, 0x1
    jnz 0x18003a0b0 <Block 12>

This loop compute the scalar sum of 8 double-precision numbers. It is not optimal, but it is relatively good in practice on most machines. This function performs a pair-wise sum instead of a naive sum so to be more accurate. This code is not automatically vectorized by the compiler because of the access pattern. One certainly need to manually write SIMD intrinsics to speed it up. However, this has certainly a significant maintainability overhead ( should we keep the scalar version and should we need to support several different types floating-point types?).

The assembly code executed (hot loop) for the bitwise OR is the following:

Block 39:
    mov rcx, r10                   ; r10 is your scalar integer
    lea r9, ptr [r9+0x8]
    or rcx, qword ptr [rax]        ; Bitwise OR reading data from memory
    lea rax, ptr [rax+0x8]
    mov qword ptr [r9-0x8], rcx    ; Store the result back in memory
    sub r8, 0x1
    jnz 0x180187891 <Block 39>

This loop is not efficient : it is a scalar loop which is not unrolled. I guess this loop is not automatically vectorized by the compiler certainly because one of the input value is a scalar. This is sad because this is relatively easy for compilers to use SIMD instructions in this case.

I submitted a PR that might improve the speed of theses function, but I do not expect a huge performance increase in this specific case.

On top of that, the bitwise OR produce a newly allocated output array which is quite expensive. Indeed, filling large newly-allocated arrays cause page-faults. On Windows, a kernel thread is (mainly) responsible to do that and work about half the time of the computation.

Moreover, scalar writes are also inefficient on x86-64 platforms because cache-line write allocations cause cache-lines to be filled from the RAM before writing them back. This means half the RAM bandwidth is wasted. There are non-temporal store SIMD instructions to avoid this problem but this require the code to be vectorized and to know that the output array will not be reused from a cache later (which is hard to know since it is dependent of the actual processor used, the input array size and even the user code).

A similar behaviour can be seen on Linux with Numpy 1.24.2 (though results are significantly better).

Workarounds

Regarding the mean, you can use np.sum(bitmap)/bitmap.size if you know that the sum does not overflow. This is about 2.5 times faster on my machine (i5-9600KF).

Regarding the bitwise OR, you can preallocate the output array and reuse it so to avoid page-faults: np.bitwise_or(bitmap, np.uint64(6540943), out) where out is the output array. This is 2.7 times faster on my machine. If you do not care about keeping bitmap values, then please perform an in-place operation : np.bitwise_or(bitmap, np.uint64(6540943), bitmap). This avoid the overhead of the cache-line write allocates. If you need to write the result in another array, then non-temporal store SIMD instructions are AFAIK the only way to speed the computation up on x86-64 processor. They are available in native language C/C++ (using non-portable SIMD intrinsics or compiler annotations). If you cannot preallocate the output array, then filling it in parallel is faster on some systems (not much on Windows but significantly on Linux).

In the end, it turns out that page-faults and cache-line write allocations are the biggest overheads on my machine. While Numpy can be improved so to use SIMD instructions in this case, using scalar instructions is not so inefficient in practice (due to the RAM being slow). Processor running at a significantly lower frequency (e.g. 3 GHz) can still significantly benefit from an SIMD implementation.

Efficient set intersection computation

Based on the title, the question and comments, it looks like you want to perform a set intersection equivalent to np.any((bitmap & query) != 0). To do it efficiently, then the best solution is not to create/fill big temporary arrays in the first place as mentioned by @PeterCordes in the comments. This remove the overheads coming from page-faults and cache-line write allocations, while reducing the RAM pressure (so arrays can be read faster). Computing arrays chunk by chunk is possible (and certainly faster), but not optimal because the overhead of Numpy is significant on small arrays. Numba and Cython can do it quite efficiently. Here is a naive Numba implementation:

import numba as nb

@nb.njit('(uint64[::1], uint64[::1])')
def compute(a, b):
    assert a.size == b.size
    x = np.uint64(0)
    for i in range(a.size):
        x |= a[i] | b[i]
    return x != 0

Thanks Jerome! This is a great feedback. 2 questions if you can clarify: 1- If it's not vectorized, then wouldn't the vectorized version be 8X faster (instead of the 2.5 you're reporting using your np.sum / count). 2- I'm curious as to why the bitwise_or is not vectorized. Isn't it a super basic 1:1 translation? Thanks again! — Guillaume, Aug 07 '23 at 21:00
`test rcx, rcx` / `js 0x18011b0a7 ` is implementing **unsigned** u64 -> f64 conversion in terms of `cvtsi2sd` signed (i64) -> f64 conversion without AVX-512 `cvtusi2sd`. For numbers with the sign bit clear, we can just use `cvtsi2sd` directly, treating it as a non-negative integer. Otherwise we'd convert as a negative integer and add `(double)(2**64)` or something. (Packed-SIMD conversion for unsigned and/or 64-bit integers was new in AVX-512.) — Peter Cordes, Aug 07 '23 at 21:08
Can you slice the numpy array to cache-block the operation? Like check that the first 4 MiB are all zero, then the second 4 MiB, etc. Or the first 128 KiB (half L2) or 16K (half L1). Then you only need a scratch buffer that size. Repeatedly overwriting the same small buffer means you only pay for that amount of writing once, when it's eventually evicted. This is still hilariously dumb vs. a read-only OR reduction over the array to get one scalar to check for non-zero. (Ideally with early-out check every so often so you don't read the whole array if there's a set bit in the first 1 KiB.) — Peter Cordes, Aug 07 '23 at 21:15
I have updated the conclusion to make it a bit more clear. 1) Not here : a SIMD code will be faster only if it is not *bound by the RAM or the mentioned overheads* (i.e. page faults and write-back allocation). This is very dependent of the target machine. On my machine, the scalar implementation is pretty good and the overheads was the main issue. SIMD instructions only worth it if you can reduce the overhead. NT-Stores+AVX can give a speed-up of 1.5-2x on my machine, but more on a low-frequency CPU, and even more for arrays fitting in cache (e.g. L2/L3). — Jérôme Richard, Aug 07 '23 at 21:15
2) As indicated in the answer, my guess is that *"this loop is not automatically vectorized by the compiler certainly because one of the input value is a scalar"*. However, I do not have strong evidence for this so far. — Jérôme Richard, Aug 07 '23 at 21:20
`x |= a[i] | b[i]` is union, not intersection, assuming `|` is bitwise OR like in C. Presumably you want `x |= a[i] & b[i]` to check for a non-zero intersection. Ideally with an early-out check maybe every 1 to 4 KiB, if non-empty intersections are common in large arrays. — Peter Cordes, Aug 08 '23 at 00:40
(In x86 assembly or C with intrinsics, I'd probably check every iteration of an unrolled loop, so maybe every 128 to 256 bytes, but compilers, especially JITs, might not use `vpcmpeqb` / `vpmovmskb` to do this efficiently on x86. On ARM or AArch64 I might use an outer loop that checked less often, like 1K or 4K, or for AArch64 without SVE, might use scalar `ldp` to load 16 bytes at once into 2 scalar integer regs, if early-outs are common and I'm tuning for a CPU without really efficient vector -> integer data transfer for branching.) — Peter Cordes, Aug 08 '23 at 00:46

score 1 · Answer 2 · answered Aug 07 '23 at 16:12

1

Your array is 53 MiB large. Most operations on an array that size are limited by memory bandwidth. np.mean has to read every value once. The bitwise OR has to read each value and store a new value to memory. So you can expect half the throughput since you do double the work..

Both operations are vectorized, so this has no influence.

answered Aug 07 '23 at 16:12

Homer512

9,144
2
8
25

Ah. I see. That would explain it. Thx for this. – Guillaume Aug 07 '23 at 16:16
1

This is certainly wrong. On my machine I get respectively 6.63 ms and 15.6 ms. This means respectively 7.9 GiB/s and 3.3 GiB/s for the memory throughput. This is far smaller than my memory bandwidth which is ~40 GiB/s. VTune confirms that the RAM is not saturated in both cases, especially during the computation of the `np.mean` (8.5 GiB/s). One core can typically reach >16 GiB/s with a SIMD code. Thus, the code certainly not use SIMD instructions, nor saturate the RAM (at least not on my machine on Windows with results close to the one of the OP). – Jérôme Richard Aug 07 '23 at 17:03
1

The above one was on Windows with Numpy 1.24.3. On Linux I get 4.81 ms and 6.73 ms with Numpy 1.24.2. This is better but still far from optimal. Intel hardware performance counters (mainly `fp_arith_inst_retired.scalar_double`) clearly shows that the code is mostly use scalar (SSE) instruction. None of the two seems vectorized nor optimized actually on both Linux and Windows. – Jérôme Richard Aug 07 '23 at 17:31
@JérômeRichard Scalar SSE? That sounds weird for int64, doesn't it? You would think that if it isn't vectorized, int64 is computed in GP registers. I'll see if I can figure something out later. But even if it isn't vectorized, I guess it's either front-end pressure or page-faults because the 53 MiB output array is freshly allocated, don't you think? – Homer512 Aug 07 '23 at 18:01
2

@Homer512 Scalar SSE is for the `np.mean`. For the integers and SSE/AVX, AFAIK there is no Intel performance counter for that (sadly), but the executed assembly instructions reported by VTune are not SSE/AVX for the `or` operating on `uint64`. Numpy uses the `or` x86 scalar instruction and operate on GPR (like `rax` and `rcx`). Page faults indeed significantly impact results, but it looks like this is not the bottleneck based on VTune data : Windows uses a separate kernel thread for that and it is not saturated (it takes <50% of the time). – Jérôme Richard Aug 07 '23 at 18:15
@Homer512 Regarding the front end, I expect the LSD to do the job for small loops so the front-end can be freed, but it is disabled on Skylake-like CPUs including mine (i5-9600KF). The uops will be decoded (mostly) once, then stored in the DSB and replayed so 6 uops/cycle can be sent to the backend which will be saturated (assuming the loop are correctly aligned). More specifically, I expect the load/store ports to be saturated since my CPU can only store 1 scalar at a time (newer CPUs can up to 2). – Jérôme Richard Aug 07 '23 at 18:31
1

@JérômeRichard: The issue/rename bottleneck in Skylake is 4-wide. The 6-wide fetch from uop cache only feeds a buffer between it and the IDQ, apparently, since even the IDQ can't accept more than 4 uops per cycle from DSB, or at least there's a counter for `idq.all_dsb_cycles_4_uops`. But anyway, if 4/clock uops isn't a bottleneck, then the ROB/RS will fill and so will the IDQ, as long as a tiny loop isn't split across two DSB lines or something which reduces throughput to worse than the back-end bottleneck. – Peter Cordes Aug 07 '23 at 21:04
1

In the end, page-faults was the biggest overhead on my machine on Windows. Not so much on Linux. VTune results was underestimating the overhead on Windows (60-65% in practice rather than 45-50%)... – Jérôme Richard Aug 07 '23 at 21:30

vectorized & in numpy

2 Answers2

Under the hood

Workarounds

Efficient set intersection computation