SIMD hurts performance when running with multiple threads

Question

I have a simple rust function that parses varint encoding.

struct Reader {
    pub pos: usize,
    pub corrupt: bool,
}
impl Reader {
     fn read_var(&mut self, b: &[u8]) -> u64 {
        let mut i = 0u64;
        let mut j = 0;
        loop {
            if j > 9 {
                self.corrupt = true;
                return 0;
            }
            let v = self.read_u8(b);
            i |= (u64::from(v & 0x7F)) << (j * 7);
            if (v >> 7) == 0 {
                return i;
            } else {
                j += 1;
            }
        }
    }

    fn read_u8(&mut self, b: &[u8]) -> u8 {
        if self.pos < b.len() {
            let v = b[self.pos];
            self.pos += 1;
            v
        } else {
            self.corrupt = true;
            0
        }
    }
}

I have 2 versions of generated code by different compilers: non-SIMD SIMD

The non-SIMD version is relatively easy to understand. It inlines read_u8 and unwinds loop. I am not familiar with SIMD instructions, but the SIMD version seems to have a similar structure.

One weird thing is, when I run the SIMD version in multiple threads concurrently in a multi-core machine (different Reader objects each thread), processing throughput dropped significantly, but CPU util is higher than single thread version. The non-SIMD version throughput scales linearly with concurrency level.

Does anyone know how could this happen?

Some related questions:

The code does not look like it could benefit from SIMD. Why SIMD is generated?
Is it possible to disable SIMD generation for a single function?

Was your SIMD build in Debug mode without optimizations applied, or was it a Release build (i.e. ostensibly optimized)? (Can you post a working example to Godbolt and share the link?) — Dai, May 08 '23 at 07:03
This is decoding data with 7 bits per byte, terminated by a byte with the high bit clear? (A little bit similar to UTF-8). SIMD is tricky for this, but BMI2 `pext` (parallel bitfield extract and pack) in combination with `u64 | 0x7F7F7F7F...` / find first 0. (Or actually `~u64 & 0x8080808080...` with BMI1 `andn` and first lowest `1` bit with `tzcnt`/8 to advance the ptr (or SIMD pcmpeqb / pmovmskb). And x86 `bzhi` using the index, or `blsmsk` on the mask + `and` on the u64 data, to zero out garbage from past the terminating byte). With 9-byte inputs being a special case — Peter Cordes, May 08 '23 at 15:19
Still, there might be something to gain at least with AVX-512; see Daniel Lemire's and Robert Clausecker's paper Transcoding Unicode Characters with AVX-512 Instructions (UTF-8 to UTF-16) https://d197for5662m48.cloudfront.net/documents/publicationstatus/121638/preprint_pdf/f1cba80f1709ab20707bf8d97b19886c.pdf . Their techniques might or might not extend to wider sequences of bytes, like UTF-32, or your variable-length encoding. Dan Lemire has also worked on UTF-8 *validation*, which is easier: https://lemire.me/blog/2020/10/20/ridiculously-fast-unicode-utf-8-validation/ — Peter Cordes, May 08 '23 at 15:20
But yes, I don't expect a compiler would auto-vectorize this. And as you later discovered, it didn't; the SIMD instructions are doing something else. — Peter Cordes, May 08 '23 at 15:35
Related: [QWORD shuffle sequential 7-bits to byte-alignment with SIMD SSE...AVX](https://stackoverflow.com/q/10850054) for SIMD bit-manipulation for this variable-length encoding. — Peter Cordes, May 09 '23 at 01:19
For doing this with SIMD instead of scalar `pext`, on Ice Lake with AVX-512 you can use `vpmultishiftqb` to do parallel bitfield extracts within qwords... except that grabs 8-bit groups, not 7. Lining up variable length inputs into SIMD qwords might be a matter of using `pmovmskb` as a lookup table for `pshufb` control vectors, as in http://0x80.pl/notesen/2023-04-09-faster-parse-ipv4.html / [Fastest way to get IPv4 address from string](//stackoverflow.com/a/31683632). Maybe with some limits to keep the table size small, since we can only handle the first 2 qwords with a 128-bit shuffle — Peter Cordes, May 09 '23 at 01:22
Maybe AVX-512 variable-count rotates would be useful for moving bit-groups around. Anyway, if you want to ask a different question about how to optimize decoding this byte-stream into u64 elements, it would be interesting. — Peter Cordes, May 09 '23 at 01:23
Just for fun, I had a look at using `pext` to speed this up for one u64 at a time on Haswell / Zen 3. According to uiCA, the inner loop (that only handles u64 values up to 56-bit, i.e. 8 bytes of src data) might run as fast as 2.5 cycles per u64 on Skylake, with no dependence on the length of the u64. (And an outer loop of 64-byte chunks, to hide load / bit-scan pointer increment latency). https://godbolt.org/z/qYjTMc7zG in case anyone ever sees this. — Peter Cordes, May 11 '23 at 02:52

score 1 · Answer 1 · answered May 08 '23 at 07:38

1

It turned out the SIMD version was built with "RUSTFLAGS=-Cinstrument-coverage", which injects additional instruction to track code execution. Not quite sure how these injected instructions work and how do they affect performance.

answered May 08 '23 at 07:38

user607722

1,614
2
12
19

SIMD hurts performance when running with multiple threads

1 Answers1