12

I've got an MCVE which, on some of my machines crashes when compiled with g++ version 4.4.7 but does work with clang++ version 3.4.2 and g++ version 6.3.

I'd like some help to know if it comes from undefined behavior or from an actual bug of this ancient version of gcc.

Code

#include <cstdlib>

class BaseType
{
public:
    BaseType() : _present( false ) {}
    virtual ~BaseType() {}

    virtual void clear() {}

    virtual void setString(const char* value, const char* fieldName)
    {
        _present = (*value != '\0');
    }

protected:
    virtual void setStrNoCheck(const char* value) = 0;

protected:
    bool _present;
};

// ----------------------------------------------------------------------------------

class TypeTextFix : public BaseType
{
public:
    virtual void clear() {}

    virtual void setString(const char* value, const char* fieldName)
    {
        clear();
        BaseType::setString(value, fieldName);
        if( _present == false ) {
            return; // commenting this return fix the crash. Yes it does!
        }
        setStrNoCheck(value);
    }

protected:
    virtual void setStrNoCheck(const char* value) {}
};

// ----------------------------------------------------------------------------------

struct Wrapper
{
    TypeTextFix _text;
};

int main()
{
    {
        Wrapper wrapped;
        wrapped._text.setString("123456789012", NULL);
    }
    // if I add a write to stdout here, it does not crash oO
    {
        Wrapper wrapped;
        wrapped._text.setString("123456789012", NULL); // without this line (or any one), the program runs just fine!
    }
}

Compile & run

g++ -O1 -Wall -Werror thebug.cpp && ./a.out
pure virtual method called
terminate called without an active exception
Aborted (core dumped)

This is actually minimal, if one removes any feature of this code, it runs correctly.

Analyse

The code snippet works fine when compiled with -O0, BUT it still works fine when compiled with -O0 +flag for every flag of -O1 as defined on GnuCC documentation.

A core dump is generated from which one can extract the backtrace:

(gdb) bt
#0  0x0000003f93e32625 in raise () from /lib64/libc.so.6
#1  0x0000003f93e33e05 in abort () from /lib64/libc.so.6
#2  0x0000003f98ebea7d in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib64/libstdc++.so.6
#3  0x0000003f98ebcbd6 in ?? () from /usr/lib64/libstdc++.so.6
#4  0x0000003f98ebcc03 in std::terminate() () from /usr/lib64/libstdc++.so.6
#5  0x0000003f98ebd55f in __cxa_pure_virtual () from /usr/lib64/libstdc++.so.6
#6  0x00000000004007b6 in main ()

Feel free to ask for tests or details in the comments. Asked:

  • Is it the actual code? Yes! it is! byte for byte. I've checked and rechecked.

  • What exact version of GnuCC du you use?

    $ g++ --version
    g++ (GCC) 4.4.7 20120313 (Red Hat 4.4.7-16)
    Copyright (C) 2010 Free Software Foundation, Inc.
    This is free software; see the source for copying conditions.  There is NO
    warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
    
  • Can we see the generated assembly? Yes, here it is on pastebin.com

YSC
  • 38,212
  • 9
  • 96
  • 149
  • Is this the _exact_ code causes the issue? (this error _usually_ occurs when you call a virtual method inside a destructor, which doesn't seem to be the case here). (works fine for me, btw) – Kiril Kirov Jan 23 '17 at 15:52
  • @KirilKirov Yes, I'm positive. After I typed this code in the question, I copied it and pasted it back to my file and recompiled. I'm currently asking a college to try it on it's machine (same environment). – YSC Jan 23 '17 at 15:53
  • 3
    I don't see anything wrong, or even particularly tricky, with this code. It should compile and run. If it fails, I would say with high confidence it's a compiler bug. – Igor Tandetnik Jan 23 '17 at 15:53
  • @YSC - I updated my comment. Please recheck the situation - try exactly this code again and make sure you're not compiling/executing the wrong files. – Kiril Kirov Jan 23 '17 at 15:55
  • 1
    I actually happen to have g++ 4.4.6 easily accessible to me and the program does not core dump with that g++ so it looks strongly like a 4.4.7 compiler bug. – Mark B Jan 23 '17 at 15:57
  • Your code works with gcc4.4.7 and gcc4.3.6 on wandbox, so I'd say the very compiler on your machine is broken. – Baum mit Augen Jan 23 '17 at 15:57
  • Nothing seems to be wrong with the code. On a side note, you don't need virtual function in this case. try compiling with normal function and see if you get the error – Daksh Gupta Jan 23 '17 at 16:01
  • @KirilKirov Done so, colleague on another VM did too. Same result. – YSC Jan 23 '17 at 16:01
  • 1
    Interestingly, GCC 6.2 compiles `main` to a no-op: https://godbolt.org/g/UByGsC - this could mask an otherwise latent bug in the compiler or code. FWIW I don't see anything wrong with the code. Of course It is also possible that the compiler bug, if any, has anyway been fixed. – davmac Jan 23 '17 at 16:08
  • 2
    Can you dump assembly output somewhere? – Slava Jan 23 '17 at 16:12
  • 1
    @Slava [here it is](http://pastebin.com/H6WHi3pK) – YSC Jan 23 '17 at 16:17
  • 1
    @davmac That comment confused me. The program doesn't do anything, so the compiler making `main` a no-op seems like a sensible optimisation to me, it doesn't appear to expose any bug in GCC 6. –  Jan 23 '17 at 18:49
  • I'd hardly call 4.4.7 "ancient" – Lightness Races in Orbit Jan 23 '17 at 19:53
  • @hvd I meant that the assembly generated is minimal and performs essentially no operation. Certainly, this is a valid and correct optimisation of the program; I didn't mean to imply that it wasn't. However, this optimisation is not performed the 4.4.7 compiler (or at least not to the same degree), and the generated code may contain the bug. Hence the optimisation that recognizes that `main` is effectively a no-op may be masking a bug. – davmac Jan 23 '17 at 21:35

2 Answers2

12

This is a Red Hat-specific bug not present in FSF GCC. It is not a problem in your code.

On a system with both CentOS 6's GCC, and FSF GCC 4.4.7, having both generate an assembly listing and viewing the differences between the two, one bit jumps out:

CentOS 6's GCC generates

movq $_ZTV8BaseType+16, (%rsp)

whereas FSF GCC 4.4.7 generates

movq $_ZTV11TypeTextFix+16, (%rsp)

In other words, one of Red Hat's GCC patches makes it set up the vtable incorrectly. This is part of your main function, you can see it in your own assembly listing shortly after .L48:.

Red Hat applies many patches to its version of GCC, and some of them are patches that affect code generation. Unfortunately, one of them appears to have an unintended side effect.

  • 1
    I really wish Red Hat would stop patching like this. – Lightness Races in Orbit Jan 23 '17 at 19:54
  • 1
    Checked this, can confirm. – n. m. could be an AI Jan 23 '17 at 20:01
  • @LightnessRacesinOrbit I've herd of redhat's gcc 2.96. I can call this one "ancient" can't I? ; @ hvd thanks I'll test to confirm that tomorrow morning. – YSC Jan 23 '17 at 20:05
  • Defining the constructor of `BaseType` in a separate translation unit does fix (in fact, hide) the bug: both the `BaseType` _and_ `TypeTextFix` virtual tables are written indeed. Thanks a lot! May I edit your answer to include that hack for future reference? – YSC Jan 24 '17 at 08:59
  • @YSC I'm glad you found something that works for you, but I do not understand what is making GCC do this and why your workaround works, and I'd actually prefer not to recommend that to others. It's possible that you're stuck in a position without any options, but generally, my recommended workaround/fix would be to use a compiler that does not suffer from this bug, rather than changing the code. –  Jan 25 '17 at 23:39
  • @hvd of course, you're right, but as I write in my answer, a change in the build process as big as a new compiler is a long process ; so for now I'm stuck with this workaround. I won't suggest it to other devs, but if it can help one person I would not have wasted my time. – YSC Jan 26 '17 at 09:38
0

Though the true solution to this bug would be not to use RedHat GnuCC 4.4.7 (or any RedHat compiler...), we are temporarily stuck with this version.

We did find an alternative: obfuscate the constructor of BaseType to the compiler hence preventing it to over-optimize it. We did it simply by defining BaseType::BaseType() in a separate translation unit.

Doing so bypass g++ bug. We did indeed checked that both BaseType and TypeTextFix virtual table pointers were written to constructed object before calling its related constructors.

YSC
  • 38,212
  • 9
  • 96
  • 149