I built a small program (~1000 LOC) using GCC 11.1 and ran it for many iterations both with and without enabling -march=native
but overall there was no difference in terms of program execution time (measured in milliseconds). But why? Because it's single-threaded? Or is my stone age hardware (1st gen i5, Westmere microarchitecture with no AVX stuff) not capable enough?
A few lines from my Makefile:
CXX = g++
CXXFLAGS = -c -std=c++20 -Wall -Wextra -Wpedantic -Wconversion -Wshadow -O3 -march=native -flto
LDFLAGS = -O3 -flto
Here (Compiler Explorer) is a free function from the program for which GCC does not generate SSE instructions:
[[ nodiscard ]] size_t
tokenize_fast( const std::string_view inputStr, const std::span< std::string_view > foundTokens_OUT,
const size_t expectedTokenCount ) noexcept
{
size_t foundTokensCount { };
if ( inputStr.empty( ) ) [[ unlikely ]]
{
return foundTokensCount = 0;
}
static constexpr std::string_view delimiter { " \t" };
size_t start { inputStr.find_first_not_of( delimiter ) };
size_t end { };
for ( size_t idx { }; start != std::string_view::npos && foundTokensCount < expectedTokenCount; ++idx )
{
end = inputStr.find_first_of( delimiter, start );
foundTokens_OUT[ idx ] = inputStr.substr( start, end - start );
++foundTokensCount;
start = inputStr.find_first_not_of( delimiter, end );
}
if ( start != std::string_view::npos )
{
return std::numeric_limits<size_t>::max( );
}
return foundTokensCount;
}
I want to know why? Maybe because it's not possible to vectorize such code?
Also, another thing I want to mention is that the size of the final executable did not change at all and I even tried -march=westmere
and -march=alderlake
to see if makes any difference in size but GCC generated it with the same size.