2

I am after some suggestions as to how to go about debugging a significant problem that I cannot reduce to a minimal example.

The problem: I compile my application which links to a number of different libraries. The flags include: -static-libstdc++ -static-libgcc -pipe -std=c++1z -fno-PIC -flto=10 -m64 -O3 -flto=10 -fuse-linker-plugin -fuse-ld=gold -UNDEBUG -lrt -ldl

The compiler is gcc-7.3.0, compiled against binutils-2.30. Boost is compiled with the same flags as the rest of the program, and linked statically.

When the program is linked, I get various warnings about relocation refers to discarded section, both in my own code, and in boost. For instance:

/tmp/ccq2Ddku.ltrans13.ltrans.o:<artificial>:function boost::system::(anonymous namespace)::generic_error_category::message(int) const: warning: relocation refers to discarded section

Then when I run the program, it segfaults on destruction, with the backtrace:

Program received signal SIGSEGV, Segmentation fault.
0x0000000000000000 in ?? ()
(gdb) bt
#0  0x0000000000000000 in ?? ()
#1  0x00007ffff7345a49 in __run_exit_handlers () from /lib64/libc.so.6
#2  0x00007ffff7345a95 in exit () from /lib64/libc.so.6
#3  0x00007ffff732eb3c in __libc_start_main () from /lib64/libc.so.6
#4  0x000000000049b3e3 in _start ()

The function pointer attempting to be called is 0x0.

If I remove using static-libstdc++, the linker warnings and runtime segfault go away.

If I change from c++1z to c++14, the linker warnings and runtime segfault go away.

If I remove -flto, the linker warnings and runtime segfault go away.

If I add "-g" to the compile flags, the linker warnings and runtime segfault go away.

I have tried asking gold for extra debugging, by specifying -Wl,--debug=all, but it tells me seemingly nothing relevant.

If I try and use a small section of the code that appears relevant, compile and link it separately but to the same boost libraries (ie. attempting to produce minimal example), there are no linker warnings, and the program runs to completion without issues.

Help! What can I do to narrow the problem down?

Andrew
  • 277
  • 3
  • 11
  • 1
    is `static` and run-time relocation compatible? most likely not. `-fno-PIC` would not be in favor of relocation in `.text`. – Joseph D. Apr 03 '18 at 04:23
  • check which section this symbol is located `function boost::system::(anonymous namespace)::generic_error_category::message` by using `readelf` – Joseph D. Apr 03 '18 at 04:24

1 Answers1

9

This warning is usually indicative of an inconsistency in the contents of a COMDAT group between two compilation units. If the compiler emits a COMDAT group G with symbol A defined in one compilation unit, but emits the same group G with symbols A and B defined in a second compilation unit, the linker will keep group G from the first compilation unit and discard group G from the second. Any references to symbol B from outside the group in the second compilation unit will produce this error.

The cause is usually a bug in the compiler, and using -flto makes it that much harder to diagnose. In this case, your second compilation unit is the result of link-time optimization (the *.ltrans.o file name). With LTO, it's quite believable that many of the changes you've mentioned will make the problem go away.

The very latest version of gold on the master branch of the binutils git repo has a new [-Wl,]--debug=plugin option, which will save a log and all the temporary .ltrans.o files. Having the log and those files, along with all the original input files (which you can get a list of by adding the [-Wl,]-t option), should help isolate the problem better.

The latest version of gold will also print the symbol referenced by the relocation. For a local symbol, it will show the symbol index; use readelf -s to get more info about the symbol. For a global symbol, it will show the name; you can add the --no-demangle option for the exact name.

If it's a local symbol, the problem is almost certainly the compiler. References from outside a comdat group to a local symbol in the group are strictly forbidden.

If it's a global symbol, it could be either a compiler problem or a one-definition rule (ODR) violation in your sources. You'll need to identify the comdat group in the named object file, find its key symbol, then find the object file that provided the definition kept by the linker (the -y option will help), and compare the symbols defined in those groups by the two objects. These steps should help:

(1) Starting from the error message:

b.o(.data+0x0): warning: relocation refers to symbol "two" defined in discarded section

(2) Look for symbol "two" in b.o:

$ readelf -sW b.o | grep two
     7: 0000000000000008     0 NOTYPE  WEAK   DEFAULT    6 two

The next-to-last field ("6") is the section number where "two" is defined.

(3) Verify that section 6 is in fact a comdat group:

$ readelf -SW b.o
  [Nr] Name              Type            Address          Off    Size   ES Flg Lk Inf Al
  [ 6] .one              PROGBITS        0000000000000000 000058 000018 00 WAG  0   0  1

The "G" in the sh_flags field ("Flg") indicates the section belongs to a comdat group.

(4) Find the comdat group containing the section:

$ readelf -g b.o
COMDAT group section [    1] `.group' [one] contains 1 sections:
   [Index]    Name
   [    6]   .one

This shows us that section 6 is a member of group section 1.

(5) Find the key symbol for that group:

$ readelf -SW b.o
      [Nr] Name              Type            Address          Off    Size   ES Flg Lk Inf Al
      [ 1] .group            GROUP           0000000000000000 000040 000008 04      7   8  4

The sh_info field ("Inf") tells us the key symbol is symbol #8, which is "one". (That should match the name shown in brackets in step 4.)

$ readelf -sW b.o
   Num:    Value          Size Type    Bind   Vis      Ndx Name
     8: 0000000000000000     0 NOTYPE  WEAK   DEFAULT    6 one

(6) Now you can add the -y one option to your link to find which objects provided a definition of "one":

$ gcc -Wl,-y,one ...
a.o: definition of one
b.o: definition of one

The first one listed (a.o) is the one that gold keeps; it will discard all subsequent comdat groups with the same key symbol.

If you use the same techniques to examine the comdat group that defines "one" in a.o, and compare the symbols that belong to that group with those that belong to the group in b.o, that should give you more clues.

Cary Coutant
  • 606
  • 3
  • 7
  • Cary, thanks for the help. I can't seem to see an lto debug option in the source, and specifying it doesn't seem to do anything? See debug.h from sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;a=blob;f=gold/… Am I looking in the correct place? – Andrew Apr 04 '18 at 07:05
  • Sorry, I mistyped the option -- it should be `--debug=plugin`. I've edited my original response to reflect this. – Cary Coutant Apr 05 '18 at 15:37
  • I've just committed a new gold patch to improve the warning message somewhat: it now prints the symbol referenced by the relocation (local symbol index or global symbol name). If it's a local symbol, the problem is almost certainly the compiler. References from outside a comdat group to a local symbol in the group are strictly forbidden. – Cary Coutant Apr 05 '18 at 22:04
  • If it's a global symbol, it could be either a compiler problem or a one-definition rule (ODR) violation in your sources. You'll need to identify the comdat group in the named object file, find its key symbol, then find the object file that provided the definition kept by the linker (the `-y` option will help), and compare the symbols defined in those groups by the two objects. – Cary Coutant Apr 05 '18 at 22:10
  • Thanks Cary. With your help it turned out that the basic_string destructor was UND: `265: 0000000000000000 0 NOTYPE GLOBAL DEFAULT UND _ZNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEED1Ev`. This then lead me to gcc bug: `https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81004`. Building gcc from gcc-7-branch fixed the issue. I am still a bit perplexed why I get a 'relocation refers to discarded section' and a runtime failure, rather than an undefined reference? Thanks so much for the help though, and that debugging info is going to help a lot of people. – Andrew Apr 06 '18 at 04:50
  • You know, I really don't remember why it's only a warning -- I agree that it probably should be emitted as an error. I'll think about it. Glad you're up and running! – Cary Coutant Apr 06 '18 at 05:13