using -march switch for gcc does not make a difference in terms of run-time speed

Question

I built a small program (~1000 LOC) using GCC 11.1 and ran it for many iterations both with and without enabling -march=native but overall there was no difference in terms of program execution time (measured in milliseconds). But why? Because it's single-threaded? Or is my stone age hardware (1st gen i5, Westmere microarchitecture with no AVX stuff) not capable enough?

A few lines from my Makefile:

CXX = g++
CXXFLAGS = -c -std=c++20 -Wall -Wextra -Wpedantic -Wconversion -Wshadow -O3 -march=native -flto
LDFLAGS = -O3 -flto

Here (Compiler Explorer) is a free function from the program for which GCC does not generate SSE instructions:

[[ nodiscard ]] size_t
tokenize_fast( const std::string_view inputStr, const std::span< std::string_view > foundTokens_OUT,
               const size_t expectedTokenCount ) noexcept
{
    size_t foundTokensCount { };

    if ( inputStr.empty( ) ) [[ unlikely ]]
    {
        return foundTokensCount = 0;
    }

    static constexpr std::string_view delimiter { " \t" };

    size_t start { inputStr.find_first_not_of( delimiter ) };
    size_t end { };

    for ( size_t idx { }; start != std::string_view::npos && foundTokensCount < expectedTokenCount; ++idx )
    {
        end = inputStr.find_first_of( delimiter, start );
        foundTokens_OUT[ idx ] = inputStr.substr( start, end - start );
        ++foundTokensCount;
        start = inputStr.find_first_not_of( delimiter, end );
    }

    if ( start != std::string_view::npos )
    {
        return std::numeric_limits<size_t>::max( );
    }

    return foundTokensCount;
}

I want to know why? Maybe because it's not possible to vectorize such code?

Also, another thing I want to mention is that the size of the final executable did not change at all and I even tried -march=westmere and -march=alderlake to see if makes any difference in size but GCC generated it with the same size.

Recommend adding a brief run-down of the nature of the program. Could be it just doesn't make use of anything that takes advantage a specific architecture. I'll leave that to the experts, but I'm pretty sure the experts will need more information. — user4581301, Apr 13 '22 at 21:17
The compiler is neither omnipotent nor magic, maybe your code simply isn't any more optimisable with the available instructions than without or possibly the compiler simply doesn't know how to do the optimisations. Without a [mre] it's difficult to help — Alan Birtles, Apr 13 '22 at 21:18
make sure that the flags are actually passed to gcc. use `make VERBOSE=1` — Aziz, Apr 13 '22 at 21:18
@Alan Birtles Yes but it's difficult to make an MRE for this. Maybe I can write a small MRE and edit the question. Let's see. — digito_evo, Apr 13 '22 at 21:22
Before you start optimising you must profile your code, then you'll know which bits are slow and you can concentrate your efforts there, that profiling would also show what is the critical part of your code to post in the question — Alan Birtles, Apr 13 '22 at 21:24
@Aziz I think it does see the flags cause when compiled with e.g. *alderlake*, my system does not execute the app and it says ***Illegal instruction (core dumped)***. — digito_evo, Apr 13 '22 at 21:25
@Alan Birtles Take a look at [this](https://godbolt.org/z/EWrbja8G1) function from my program. Simply put, it's a program that mostly deals with input/output and does lots of string processing (splitting, validating, converting to integer, printing). Not many arithmetic operations. And GCC doesn't generate any kind of SIMD instructions for the said function. Is it because of the nature of it? — digito_evo, Apr 13 '22 at 22:26
I think you should be specifying `-march=native` as part of LDFLAGS as well, so `-flto` is targeting the same machine. But yeah, it's quite possible that `-mtune=generic` makes the same tuning decisions as `-march=native`, and that there's nothing that benefits from anything more than SSE2. Your CPU supports SSE4.2 and popcnt, but baseline is already SSE2, same vector width just missing some instructions. — Peter Cordes, Apr 14 '22 at 01:13

score 1 · Accepted Answer · answered Apr 14 '22 at 01:25

I think you should be specifying -march=native as part of LDFLAGS as well, so -flto is targeting the same machine.

But it seems your code-gen is respecting your specified arch since you say -march=alderlake make code that crashed with SIGILL, probably on an AVX encoding of a vector instruction.

It's quite possible that -mtune=generic makes the same tuning decisions as -march=native, and that there's nothing that benefits from anything more than SSE2. Your CPU supports SSE4.2 and popcnt, but baseline for x86-64 is already SSE2, same vector width just missing some instructions, especially for dword and qword element sizes (like packed min/max).

GCC/clang can't auto-vectorize search loops (only loops where the trip-count is known at runtime before the first iteration), so inputStr.find_first_of either compiles to a one-byte-at-a-time search, or calls memchr which only benefits from SSE2 anyway, but can dynamic dispatch based on CPU features since it's in a shared library.

(Glibc overloads the dynamic linking process with a "resolver" function that decides which implementation of memchr is best on the current machine, either SSE2 or AVX2. The both versions are hand-written asm, for example the SSE2 version's source. A few functions like strstr have SSE4.2 versions that you CPU can take advantage of, but this choice doesn't depend on -march compile-time settings, purely run-time dynamic linker + glibc.)

If you want to see where your program is spending most of its time, use perf record ./a.out / perf report -Mintel (the default is AT&T syntax disassembly; I prefer Intel).

If it's in library functions, different tuning options and new instructions available probably aren't helping your main code. If it's in your program proper, not libs, then apparently the baseline instruction-set for x86-64 and the "generic" tuning options are fine, or GCC doesn't know how to get any use out of SSSE3 / SSE4.x for your code.

I didn't look much at what your code is doing to see what manual vectorization might be possible.

Thanks for the tips. I will consider `perf record` to see what info I can get out of it. Also if you want to have a look at the code you can chech it [here](https://github.com/zencatalyst/peyknowruzi) and just navigate to CharMatrix.cpp and Util.cpp files since most of the important stuff are in those files. — digito_evo, Apr 17 '22 at 19:10

using -march switch for gcc does not make a difference in terms of run-time speed

1 Answers1