Missed optimization with string_view::find_first_of

Question

Update: relevant GCC bug report: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103798

I tested the following code:

#include <string_view>

size_t findFirstE_slow(std::string_view sv) {
  return sv.find_first_of("eE");
}

size_t findFirstE_fast(std::string_view sv) {
  auto it{sv.begin()};
  for (; it != sv.end() && *it != 'e' && *it != 'E'; ++it)
    ;
  return it == sv.end() ? std::string_view::npos : size_t(it - sv.begin());
}

quick-bench test: https://quick-bench.com/q/dSU3EBzI8MtGOFn_WLpK3ErT3ok

Compiler Explorer output: https://godbolt.org/z/eW3sx61vz

Both findFirstE_slow() and firstFirstE_fast() functions are meant to do the same thing, but findFirstE_slow() runs significantly slower (at least 5x on the quick-bench test).

Here's the assembly output for x86-64 gcc (trunk) -std=c++20 -O3.

findFirstE_slow():

.LC0:
        .string "eE"
findFirstE_slow(std::basic_string_view<char, std::char_traits<char> >):
        push    r12
        push    rbp
        push    rbx
        test    rdi, rdi
        je      .L4
        mov     rbx, rdi
        mov     rbp, rsi
        xor     r12d, r12d
        jmp     .L3
.L8:
        add     r12, 1
        cmp     rbx, r12
        je      .L4
.L3:
        movsx   esi, BYTE PTR [rbp+0+r12]
        mov     edx, 2
        mov     edi, OFFSET FLAT:.LC0
        call    memchr
        test    rax, rax
        je      .L8
        mov     rax, r12
        pop     rbx
        pop     rbp
        pop     r12
        ret
.L4:
        mov     r12, -1
        pop     rbx
        pop     rbp
        mov     rax, r12
        pop     r12
        ret

findFirstE_fast():

findFirstE_fast(std::basic_string_view<char, std::char_traits<char> >):
        add     rdi, rsi
        cmp     rdi, rsi
        je      .L13
        mov     rax, rsi
        jmp     .L12
.L15:
        add     rax, 1
        cmp     rdi, rax
        je      .L13
.L12:
        movzx   edx, BYTE PTR [rax]
        and     edx, -33
        cmp     dl, 69
        jne     .L15
        sub     rax, rsi
        ret
.L13:
        mov     rax, -1
        ret

Interestingly, findFirstE_slow() calls memchr("eE", *current_char, 2) for every character in sv. On the other hand, findFirstE_fast() does what we would reasonably expect, by comparing each character in sv with 'e' and 'E'.

Clang generates similar output.

Question: Is there a missed optimization here for short strings like the one in my test? Am I missing something to get GCC to generate faster code?

Update: this has been addressed for small constant strings in this patch: https://gcc.gnu.org/git/gitweb.cgi?p=gcc.git;h=c6cf555a88f8ffb8ec85c2b94fe656ab217260ea — zwliew, Sep 28 '22 at 07:57

score 5 · Accepted Answer · answered Dec 21 '21 at 11:08

libstdc++'s std::string_view::find_first_of looks something like:

size_type find_first_of(std::string_view v, std::size_t pos = 0) {
    if (v.empty()) return npos;
    for (; pos < size(); ++pos) {
        const char_type* p = traits_type::find(v.data(), v.size(), this->data()[pos]);
        if (p) return pos;
    }
    return npos;
}

You can see how traits_type::find gets transformed into memchr.

The crux of the issue is that memchr("eE", this->data()[pos], 2) != nullptr isn't compiled the same way as this->data()[pos] == 'e' || this->data()[pos] == 'E', even though the latter is much more efficient.

You can check this by trying to compile this:

constexpr unsigned char characters[] = "eE";

bool a(unsigned char* p) {
    return __builtin_memchr(characters, *p, 2);
}

bool b(unsigned char* p) {
    return *p == characters[0] || *p == characters[1];
}

This is a missed optimisation, but you can hint to the compiler to not use memchr with a custom traits type:

struct char_traits : std::char_traits<char> {
    static constexpr const char_type* find(const char_type* p, std::size_t count, const char_type& ch) {
        if (__builtin_constant_p(count) && count < 5) {
            switch (count) {
                case 0: return nullptr;
                case 1: return ch == *p ? p : nullptr;
                case 2: return ch == *p ? p : ch == *++p ? p : nullptr;
                case 3: return ch == *p ? p : ch == *++p ? p : ch == *++p ? p : nullptr;
                case 4: return ch == *p ? p : ch == *++p ? p : ch == *++p ? p : ch == *++p ? p : nullptr;
            }
        }
        return std::char_traits<char>::find(p, count, ch);
    }
};

using string_view = std::basic_string_view<char, char_traits>;

size_t findFirstE_slow(string_view sv) {
  return sv.find_first_of(characters);
}

// Also your "fast" version needs to return
//    return it == sv.end() ? string_view::npos : size_t(it - sv.begin());
// to be equivalent

(https://godbolt.org/z/bhPPxjboE)

And https://quick-bench.com/q/QVxVTxGEagUUCPuhFi9T8wjI1qQ says the slow version is now only 1.3x slower. Using a larger string (https://quick-bench.com/q/el0ukDywBNMoGsEb33PM_g4WUaY; 8000 chars before an 'e'), the difference is mostly unnoticeable.

The main difference now is that one iterates over indices and the other over pointers (returning the difference at the end). The two differing instructions in the assembly are movzx edx, BYTE PTR [rsi+rax] and movzx edx, BYTE PTR [rax] sub rax, rsi, where you should find the second version is ever so slightly faster (especially asymptotically, since the subtraction happens outside the loop)

Kind of a shame that there is no ```mempbrk``` equivalent to ```strpbrk``` which is really what this function was meant to be but that's not usable with string views. — Homer512, Dec 21 '21 at 14:09
Thanks for the modification to the `findFirstE_fast()`; I forogt about that. — zwliew, Dec 21 '21 at 14:20
Would it be reasonable to expect GCC to optimize `findFirstE_fast_index()` and `findFirstE_fast()` to the same assembly code? — zwliew, Dec 21 '21 at 14:28
@zwliew Probably not. Even though they are equivalent semantically, I've done some reading and it's actually not straightforward which is more performant: `ptr[idx++]` has a constant base pointer, which might be faster on machines that can offset and dereference quickly, but might be slower in other cases (compared to `*ptr++`). So a compiler probably wouldn't change one to the other, and it's up to the code writer to choose if they want indices or pointers. And the performance difference will probably be in fractions of a cycle, so it probably won't become a real bottleneck. — Artyer, Dec 22 '21 at 08:34

Missed optimization with string_view::find_first_of

1 Answers1