0

I have a function which gets a character and checks it then return another character (depends on received character).

I used (switch) to check the provided character and return what we want but I need more speed so I used (SSE2) too.

My SSE2 function is 1.5x slower than switch function. Why? What's slow about my SSE2 function, and what is gcc -O3 doing to implement switch that's so fast?

char
switch_func(char c) {
    switch (c) {
        case '0':
            return 0x40;
        case '1':
            return 0x41;
        case '2':
            return 0x42;
        case '3':
            return 0x43;
        case '4':
            return 0x44;
        case '5':
            return 0x45;
        case '6':
            return 0x46;
        case '7':
            return 0x47;
        case '8':
            return 0x48;
        case '9':
            return 0x49;
        case 'a':
            return 0x4a;
        case 'b':
            return 0x4b;
        case 'c':
            return 0x4c;
        case 'd':
            return 0x4d;
        case 'e':
            return 0x4e;
        case 'f':
            return 0x4f;
        default:
            return 0x00;
    }
}

and the SSE2 function ->

char
SSE2_func(char c) {

    __m128i vec0 = _mm_set_epi8('f','e','d','c','b','a','9',
            '8','7','6','5','4','3','2','1','0');
    __m128i vec1 = _mm_set1_epi8(c);

    static char list[] = {
            0x40,0x41,0x42,0x43,0x44,0x45,0x46,0x47,0x48,0x49,0x4a,0x4b,0x4c,0x4d,0x4e,0x4f
    };

    vec1 = _mm_cmpeq_epi8(vec0, vec1); // Compare to find (c) in (vec0) list

    int x;
    if((x = _mm_movemask_epi8(vec1)) != 0) {
        if((x = __builtin_ctz(x)) < 16) { // x is the position of (c) character in (list[])
            return list[__builtin_ctz(x)];
        }
    }
    return 0x00;
}

GCC compiler : (-O3 -msse2)

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Jason
  • 75
  • 1
  • 6
  • 3
    honestly, I think it's hard to outperform `switch()` with a decent optimizer. – Ingo Leonhardt Aug 06 '19 at 17:11
  • 2
    How are you benchmarking this (what kind of loop)? And what CPU are you testing on? Ryzen? Skylake? Core 2? Also which gcc version? (I think the question is mostly answerable without that, but a better implementation than both may be possible for CPUs with fast `cmov` because you only have 2 contiguous ranges). Like `tmp = c - 'a'` and if that wraps (unsigned) then assume it was a `0..9` digit. Can you assume your input is a hex digit, or does this function *need* to return `0` otherwise? What about chars between `'9'` and `'a'` : are those possible? – Peter Cordes Aug 06 '19 at 21:27
  • 2
    An sse2 (or generalized vector extensions) solution _could_ do 16 at a time (or more) instead of 1. The switch solution is fast for single items because the compiler generates a lookup table that you can easily create using designated initializers. Your sse2 solution can't be vectorized by normal compilers because you perform non-vectorizable ops on individual components. Using gcc vector extensions only to mathematically compute the value will allow you to do 8, 16, 32, 64 or more at a time ... but you didn't state your use case. Need more info to answer. – technosaurus Aug 07 '19 at 00:19
  • @technosaurus: we have mostly enough info to answer the question asked (why is switch faster), like I did in comments on the answer with asm output. But yeah we don't have enough info to know if an answer that processes 4, 8, or 16 bytes at a time with `_mm_cmpgt_epi8` and a blend would be useful, and how much checking for out-of-bounds chars is needed. But nice idea, that's the SIMD version of what I was thinking with `cmov`. – Peter Cordes Aug 07 '19 at 02:04
  • 1
    @PeterCordes yes, we have enough to answer the question asked, but not the question intended. It goes back to your question about how it's benchmarked. The answers are different if it's just called often in hot code (lut is fast) or if it's sequentially computed in large batches (simd may be faster). The simd code may be different depending on whether the input and output data can be aligned or if it's in random sized chunks in random locations. Your `cmov` solution can be done with vector extensions using compare+& wizardry. – technosaurus Aug 07 '19 at 03:22
  • 1
    @technosaurus: yes exactly. Hopefully that helps the OP figure out what they need. In large batches SIMD will pretty *definitely* be faster if the input data is contiguous in memory. (Especially if we can use SSE4.1 for `pblendvb` and run on a CPU where that doesn't create a port5 bottleneck. AVX512BW would be the best case, giving us unsigned byte compares so range checks can be sub/unsigned-cmp without having to range-shift for `pcmpgtb` by flipping the top bit (or use saturating sub / pcmpeqb instead), and AVX512 makes blending nearly free as part of other operations.) – Peter Cordes Aug 07 '19 at 03:30

2 Answers2

4

Most compilers will convert your switch into a lookup table or jump table as if it were similar to the following code:

char lut_func(char c){
    static const char lut[256] = {
        ['0']=0x40, ['1']=0x41, ['2']=0x42, ['3']=0x43,
        ['4']=0x44, ['5']=0x45, ['6']=0x46, ['7']=0x47,
        ['8']=0x48, ['9']=0x49, ['a']=0x4a, ['b']=0x4b,
        ['c']=0x4c, ['d']=0x4d, ['e']=0x4e, ['f']=0x4f,
        /* everything else is set to 0 automatically */
    };
    return  lut[(unsigned char)c];
}

The only problems with this:

  • cannot vectorize
  • the common? data (0-9,a-f) spans 2 64 byte data cache lines

You can remedy the cache line misses by properly aligning and offsetting the data (your compiler may be able to do this if you profile your code) something like:

char lut_func(char c){
    static const char __attribute__((aligned(64)))lut_data[256+16] = {
        ['0'+16]=0x40, ['1'+16]=0x41, ['2'+16]=0x42, ['3'+16]=0x43,
        ['4'+16]=0x44, ['5'+16]=0x45, ['6'+16]=0x46, ['7'+16]=0x47,
        ['8'+16]=0x48, ['9'+16]=0x49, ['a'+16]=0x4a, ['b'+16]=0x4b,
        ['c'+16]=0x4c, ['d'+16]=0x4d, ['e'+16]=0x4e, ['f'+16]=0x4f,
        /* everything else is set to 0 automatically */
    };
    char lut = lut_data+16;
    return  lut[(unsigned char)c];
}

Its hard to say if this will help much since neither the makeup of the data nor the benchmark was included.

The hand written SSE2 code (though clever) unfortunately contains non-SSE2 code that slows down the code and makes it difficult to auto-vectorize (__builtin_ctz, if and the char array access) especially if you are limited to SSE2. This is just less efficient than a single data access when the data is already "hot". It may still be worth using the SSE2 version if its infrequently called, but if that were the case you wouldn't need to optimize it.

If you can access the data sequentially, you can use vector extensions to get SIMD code something like this:

//this vector extension syntax requires gcc or clang versions 5+
typedef __INT8_TYPE__ i8x16 __attribute__ ((__vector_size__ (16), aligned(16), __may_alias__));
i8x16 vec_func(i8x16 c){
    i8x16 is09 = (c>='0') & (c<='9');
    i8x16 isaf = (c>='a') & (c<='f');
    return (c & (is09 | isaf)) + (16 & is09) - (23 & isaf);
}

Compiled on architectures with SIMD instructions (x86_64, arm+neon, ppc+altivec, etc..) this compiles to ~20 instructions and accesses around 80 bytes of data to compute 16 sequential characters (with AVX2 you can do 32 with minimal modification)

For example compilation with generic x86_64 yields:

vec_func:                                   # @lu16
    movdqa  xmm1, xmm0
    pcmpgtb xmm1, xmmword ptr [rip + .LCPI0_0]
    movdqa  xmm2, xmmword ptr [rip + .LCPI0_1] # xmm2 = [58,58,58,58,58,58,58,58,58,58,58,58,58,58,58,58]
    pcmpgtb xmm2, xmm0
    movdqa  xmm3, xmm0
    pcmpgtb xmm3, xmmword ptr [rip + .LCPI0_2]
    pand    xmm2, xmm1
    movdqa  xmm1, xmmword ptr [rip + .LCPI0_3] # xmm1 = [103,103,103,103,103,103,103,103,103,103,103,103,103,103,103,103]
    pcmpgtb xmm1, xmm0
    pand    xmm1, xmm3
    movdqa  xmm3, xmm2
    por     xmm3, xmm1
    pand    xmm3, xmm0
    pand    xmm2, xmmword ptr [rip + .LCPI0_4]
    pand    xmm1, xmmword ptr [rip + .LCPI0_5]
    por     xmm1, xmm2
    paddb   xmm1, xmm3
    movdqa  xmm0, xmm1
    ret

or with AVX2 enabled

vec_func:
    vpcmpgtb        xmm1, xmm0, xmmword ptr [rip + .LCPI0_0]
    vmovdqa xmm2, xmmword ptr [rip + .LCPI0_1] # xmm2 = [58,58,58,58,58,58,58,58,58,58,58,58,58,58,58,58]
    vpcmpgtb        xmm2, xmm2, xmm0
    vpcmpgtb        xmm3, xmm0, xmmword ptr [rip + .LCPI0_2]
    vpand   xmm1, xmm1, xmm2
    vmovdqa xmm2, xmmword ptr [rip + .LCPI0_3] # xmm2 = [103,103,103,103,103,103,103,103,103,103,103,103,103,103,103,103]
    vpcmpgtb        xmm2, xmm2, xmm0
    vpand   xmm2, xmm3, xmm2
    vpor    xmm3, xmm1, xmm2
    vpand   xmm0, xmm3, xmm0
    vpand   xmm1, xmm1, xmmword ptr [rip + .LCPI0_4]
    vpand   xmm2, xmm2, xmmword ptr [rip + .LCPI0_5]
    vpor    xmm1, xmm2, xmm1
    vpaddb  xmm0, xmm1, xmm0
    ret

and aarch64

vec_func:
    movi    v2.16b, 0x61
    movi    v4.16b, 0x66
    movi    v1.16b, 0x30
    movi    v5.16b, 0x39
    cmge    v3.16b, v0.16b, v2.16b
    cmge    v2.16b, v4.16b, v0.16b
    cmge    v1.16b, v0.16b, v1.16b
    cmge    v5.16b, v5.16b, v0.16b
    movi    v4.16b, 0x10
    and     v2.16b, v3.16b, v2.16b
    and     v1.16b, v1.16b, v5.16b
    movi    v5.16b, 0x17
    and     v3.16b, v1.16b, v4.16b
    orr     v1.16b, v1.16b, v2.16b
    and     v2.16b, v2.16b, v5.16b
    and     v1.16b, v1.16b, v0.16b
    add     v1.16b, v1.16b, v3.16b
    sub     v0.16b, v1.16b, v2.16b
    ret

or power9

vec_func:
    xxspltib 35, 47
    xxspltib 36, 58
    vcmpgtsb 3, 2, 3
    vcmpgtsb 4, 4, 2
    xxland 0, 35, 36
    xxspltib 35, 96
    xxspltib 36, 103
    vcmpgtsb 3, 2, 3
    vcmpgtsb 4, 4, 2
    xxland 1, 35, 36
    xxlor 2, 0, 1
    xxlxor 3, 3, 3
    xxsel 34, 3, 34, 2
    xxspltib 2, 16
    xxsel 35, 3, 2, 0
    xxspltib 0, 233
    xxsel 36, 3, 0, 1
    xxlor 35, 36, 35
    vaddubm 2, 3, 2
    blr
technosaurus
  • 7,676
  • 1
  • 30
  • 52
  • Also important to mention that `_mm_set1_epi8` is not particularly fast without SSSE3 for a single-uop broadcast with `pshufb` instead of a chain of 3 shuffles. On Intel Haswell and later, 1 shuf / clock + `movd xmm, eax` could be a port5 throughput bottleneck. There enough total work in the stand-alone version of the function in another answer to fill the front-end for 3 cycles, but not 4. But anyway, `movd` + 3x shuffles is not cheap, compared to AVX512BW `vpbroadcastb xmm0, eax` which is [1 port5 uop total on SKX](https://www.uops.info/html-instr/VPBROADCASTB_XMM_K_R32.html) – Peter Cordes Aug 09 '19 at 01:10
  • I was going to say you can save the `set1(16)` constant by using a left shift to get `-16` from `0xFF`, but that doesn't work because x86 doesn't have SIMD shifts with byte granularity, only word and wider. Also, clang with AVX512BW does a very good job with your function (https://godbolt.org/z/3lJkqu), using masked compare-into-mask to AND two compare results together for free. – Peter Cordes Aug 09 '19 at 01:19
  • gcc and clang pick interesting different strategies for a scalar version of your function: https://godbolt.org/z/eIXM0L. gcc using `sbb edx,edx` to get a `-1` in a register to actually implement the code as written, clang using `cmov`. – Peter Cordes Aug 09 '19 at 01:25
  • The GCC switch code is in Bwebb's answer: subtracts 48 and uses as an offset to 54-item table – Antti Haapala -- Слава Україні Aug 09 '19 at 02:17
0

Compilers are bad at optimizing intrinsics.
This is definitely a case of premature optimization.
Why is this function too slow?
Any mainstream compiler at these optimization levels is going to turn this switch statement into a jump table and resolve the answer at compile time if possible.
You should stick with the switch statement for readability, portability and performance for such a small operation.

S.S. Anne
  • 15,171
  • 8
  • 38
  • 76
Bryan
  • 441
  • 2
  • 8
  • it's not about compile time ... i test it with reading the content of a file so it's not compile time ... it's run time ... both functions are tested in run-time action ... but im sure SSE2 can act faster but i don't know what is problem with my SSE2 function – Jason Aug 06 '19 at 17:15
  • 1
    And still the jump table will be faster. I was just stating that the result can and often is determined at compile time – Bryan Aug 06 '19 at 17:17
  • SIMD code is likely to be faster when there are a large number of cases but there aren't nearly enough here. The switch statement will also benefit performance wise due to the branch predictor – Bryan Aug 06 '19 at 17:21
  • 3
    Compilers aren't "bad at optimizing intrinsics"; the compiler makes pretty good asm given that C source. It's just not the most efficient way to implement the desired output. What makes this fast isn't a jump table but a *data* lookup table. A jump table would suffer mispredicts and be slower than SSE2. – Peter Cordes Aug 06 '19 at 21:15
  • Compilers are generally bad at optimizing intrinsics as I said in my answer as there is not much they can optimize for. You are right a data look up table is an even faster optimization a compiler might make although a jump table this size should still be faster on average than SSE2. – Bryan Aug 06 '19 at 21:28
  • 1
    If you want to see compilers be bad at optimizing intrinsics, look at gcc or older clang with ARM NEON intrinsics. gcc and especially clang are quite good with x86 intrinsics. e.g. clang's shuffle optimizer can totally transform your shuffles into different instructions. I'd say that `switch` is kind of a special case where compilers have *many* clever tricks to avoid just compiling it into a jump table. e.g. if many cases do the same thing, it might just check a bitmap. Including using the bitmap as an immediate. A jump table would be bad for unpredictable patterns. – Peter Cordes Aug 07 '19 at 01:59
  • 2
    I still wouldn't call it "bad" to compile the SSE2 version mostly as written. That's the programmer's fault for using SSE2 to process only 1 letter at a time instead of 16, given the simple mapping they want to implement. – Peter Cordes Aug 07 '19 at 02:00