Is this g++ 'illegal instruction error' due to the architecture of the CPU that was used to build GCC?

Question

We have a centos-based docker image that uses gcc 5.4 to build a large C++ code base. The docker image builds and installs gcc from source. Due to some data loss in our private docker registry, we had to rebuild/push this docker image back up to our registry, and started seeing an issue with local builds that use this docker image.

The error we are seeing is:

/usr/include/c++/5.4.0/limits:1601:7: internal compiler error: Illegal instruction
       max() _GLIBCXX_USE_NOEXCEPT { return __FLT_MAX__; }
       ^
0xa4f0cf crash_signal
        ../../gcc-5.4.0/gcc/toplev.c:383

My theory is that this error is due to the architecture of the underlying CPU that is running the build, since we build GCC from source.

Previously, we had a CI infrastructure that was based around Xeon E5 v3 CPUs (Haswell architecture). The build of this docker image was originally done on one of these CI machines and as such worked fine on local Haswell development boxes. Our CI infrastructure has since migrated to using Xeon Platinum CPUs (Skylake architecture). When I rebuilt the image, I did so on one of our new Skylake boxes.

Since I have a newer dev box, I have a Broadwell-based CPU and am unable to reproduce the issue locally. Our CI builds are working perfectly fine. The user getting this error locally has a Haswell CPU.

Is my theory sound? I have asked the user to build our docker image locally on their CPU and test the result, but is there a way to work around this more generically?

I've come across this answer that pointed me at this documentation which states I can specify the processor architecture myself via -march=***. My ideas stemming from this are:

Set -march=haswell when building GCC to prevent newer instruction sets from being enabled
Set -mno-*** when building GCC for the instruction set extensions that aren't available on Haswell, but exist on Broadwell/Skylake.

For reference, the output of lscpu had these Broadwell flags that weren't present on the Haswell box (which have associated -mno-*** flags):

3dnowprefetch
hle
rtm
rdseed
adx
smap
arch_capabilities

If it worth testing if either of these ideas address the issue? I'm hoping to get some external input since the development loop for this docker build is pretty lengthy, and I honestly have no idea of these -m flags will resolve the problem.

Also for reference, here is how we are building gcc:

# build/install gcc
RUN tar xvf /tmp/archive/gcc-5.4.0.tar.gz && \
  mkdir gcc-build && \
  pushd gcc-build && \
  ../gcc-5.4.0/configure --prefix=/usr --enable-languages=c,c++,fortran --disable-multilib --with-gmp=/usr --with-mpfr=/usr --with-mpc=/usr && \
  make -j32 && \
  popd && \
  yum remove -y gcc gcc-c++ gcc-gfortran && \
  pushd gcc-build && \
  make install && \
  popd && \
  rm -rf gcc-build gcc-5.4.0

@MatthieuBrucher: It's the expected direction. "When I rebuilt the image, I did so on one of our new Skylake boxes. The user getting this error locally has a Haswell CPU." — Ben Voigt, Nov 30 '18 at 19:38
The answer is in the question. The most likely reason is CPU architecture mismatch, and the rebuild to native target is the first thing which comes to mind. — SergeyA, Nov 30 '18 at 19:45
GCC does not perform native builds by default, for this very reason, and your Dockerfile does not change this. This is something else. Try getting a coredump (running gcc with -dH may help) and look at the faulting instruction. Also note that replacing the system compiler this way is not recommended at all and might actually be the root cause of your problems. — Florian Weimer, Nov 30 '18 at 21:45

Matthieu Brucher · Answer 1 · 2018-11-30T20:29:57.463

3

As indicated on wikipedia (https://en.wikipedia.org/wiki/List_of_Intel_CPU_microarchitectures), Skylake is posterior to Broadwell, itself posterior to Haswell.

As such a build on Skylake may not be runnable on an older CPU, and you should always add -march=haswell to your default builds that produces binaries that must run on Haswell and onward.

Tweak the architecture for your minimal platform with -march, knowing that you could have numerical differences as you enable additional instruction sets.

You can also use -mtune to specify a target for which you will optimize your code (meaning that on this platform, the code should be faster). You can mix both, as long as march is lower than mtune.

edited Nov 30 '18 at 20:29

answered Nov 30 '18 at 19:46

Matthieu Brucher

21,634
7
38
62

If most of the boxes are skylake, then maybe `-march=haswell -mtune=skylake` – Ben Voigt Nov 30 '18 at 19:48
Indeed, if the production boxes are Skylake, the `-mtune` makes sense. – Matthieu Brucher Nov 30 '18 at 19:49
Thanks! It's unclear to me if this error is due to the build of gcc or the build of our source code. Should this be added to the docker build process or our build scripts (or both)? – E. Moffat Nov 30 '18 at 20:18
I would advise both. I suppose the tests are done one the box where you build, but if the binaries are then used for something else, then you will have the same problem. – Matthieu Brucher Nov 30 '18 at 20:28

Is this g++ 'illegal instruction error' due to the architecture of the CPU that was used to build GCC?

1 Answers1