2

I'm having a problem that occurs while executing Boost.Test testcases on a cluster. The error is: *** glibc detected *** ...myprogram.test: corrupted double-linked list: 0x000000000096b4d0 ***

Running valgrind on this gives me:

==9687== Invalid free() / delete / delete[] / realloc()
==9687==    at 0x4A06016: operator delete(void*) (vg_replace_malloc.c:480)
==9687==    by 0x3A81035D2C: __cxa_finalize (in /lib64/libc-2.12.so) 
==9687==    by 0x721CD05: ??? (in /lib/libboost_unit_test_framework-gcc71-mt-d-1_65_1.so.1.65.1)
==9687==    by 0x72ABF9C: ??? (in /lib/libboost_unit_test_framework-gcc71-mt-d-1_65_1.so.1.65.1)
==9687==    by 0x3A81035991: exit (in /lib64/libc-2.12.so)
==9687==    by 0x3A8101ED23: (below main) (in /lib64/libc-2.12.so)   
==9687==  Address 0x9919d80 is 0 bytes inside a block of size 18 free'd
==9687==    at 0x4A06016: operator delete(void*) (vg_replace_malloc.c:480)
==9687==    by 0x3A81035991: exit (in /lib64/libc-2.12.so)
==9687==    by 0x3A8101ED23: (below main) (in /lib64/libc-2.12.so)   

The stacktrace from GDB looks like this:

#0  0x0000003a81032495 in raise () from /lib64/libc.so.6
#1  0x0000003a81033c75 in abort () from /lib64/libc.so.6
#2  0x0000003a810703a7 in __libc_message () from /lib64/libc.so.6
#3  0x0000003a81075dee in malloc_printerr () from /lib64/libc.so.6
#4  0x0000003a810761f3 in malloc_consolidate () from /lib64/libc.so.6
#5  0x0000003a81078c18 in _int_free () from /lib64/libc.so.6
#6  0x00000000005feae8 in boost::checked_array_delete<char(x=0x991a20 "\210\350\070\201:") at /include/boost-1_65_1/boost/core/checked_delete.hpp:41
#7  0x00000000005fbd21 in boost::scoped_array<char>::~scoped_array (this=0x94bd80, __in_chrg=<optimized out>) at /include/boost-1_65_1/boost/smart_ptr/scoped_array.hpp:69
#8  0x00000000005f9d36 in boost::execution_monitor::~execution_monitor (this=0x94bd60, __in_chrg=<optimized out>)
    at /include/boost-1_65_1/boost/test/execution_monitor.hpp:316
#9  0x00000000005fbd3c in boost::unit_test::unit_test_monitor_t::~unit_test_monitor_t (this=0x94bd60, __in_chrg=<optimized out>)
    at /include/boost-1_65_1/boost/test/unit_test_monitor.hpp:33
#10 0x0000003a81035992 in exit () from /lib64/libc.so.6
#11 0x0000003a8101ed24 in __libc_start_main () from /lib64/libc.so.6
#12 0x00000000005f5b59 in _start ()

This happens when any uncaught exception is thrown including test-failures, and on some (currently unknown) occasions. But the crash-on-exception is 100% reproducible.

The program seems fine, because locally it works w/o any such crashes. So I assume it is due to incompatibility between some modules on the cluster.

To avoid this, I recompiled Boost, and OpenBLAS but I'm still using a couple other libraries, which I don't want to rebuild (would take a lot of time) just to test each of them. Those are libSSH2, GPI2, HDF5 although they don't appear in ldd so I assume static linkage (I'm not the author of the tests) and think they are unlikely to cause problems:

    linux-vdso.so.1 =
    libpthread.so.0 =/lib64/libpthread.so.0
    librt.so.1 =/lib64/librt.so.1
    libboost_filesystem-gcc71-mt-d-1_65_1.so.1.65.1 =/lib/libboost_filesystem-gcc71-mt-d-1_65_1.so.1.65.1
    libboost_program_options-gcc71-mt-d-1_65_1.so.1.65.1 =/lib/libboost_program_options-gcc71-mt-d-1_65_1.so.1.65.1
    libboost_coroutine-gcc71-mt-d-1_65_1.so.1.65.1 =/lib/libboost_coroutine-gcc71-mt-d-1_65_1.so.1.65.1
    libboost_context-gcc71-mt-d-1_65_1.so.1.65.1 =/lib/libboost_context-gcc71-mt-d-1_65_1.so.1.65.1
    libboost_iostreams-gcc71-mt-d-1_65_1.so.1.65.1 =/lib/libboost_iostreams-gcc71-mt-d-1_65_1.so.1.65.1
    libboost_regex-gcc71-mt-d-1_65_1.so.1.65.1 =/lib/libboost_regex-gcc71-mt-d-1_65_1.so.1.65.1
    libboost_thread-gcc71-mt-d-1_65_1.so.1.65.1 =/lib/libboost_thread-gcc71-mt-d-1_65_1.so.1.65.1
    libboost_date_time-gcc71-mt-d-1_65_1.so.1.65.1 =/lib/libboost_date_time-gcc71-mt-d-1_65_1.so.1.65.1
    libboost_chrono-gcc71-mt-d-1_65_1.so.1.65.1 =/lib/libboost_chrono-gcc71-mt-d-1_65_1.so.1.65.1
    libboost_atomic-gcc71-mt-d-1_65_1.so.1.65.1 =/lib/libboost_atomic-gcc71-mt-d-1_65_1.so.1.65.1
    libboost_system-gcc71-mt-d-1_65_1.so.1.65.1 =/lib/libboost_system-gcc71-mt-d-1_65_1.so.1.65.1
    libboost_serialization-gcc71-mt-d-1_65_1.so.1.65.1 =/lib/libboost_serialization-gcc71-mt-d-1_65_1.so.1.65.1
    libdl.so.2 =/lib64/libdl.so.2
    libssl.so.10 =/usr/lib64/libssl.so.10
    libgssapi_krb5.so.2 =/lib64/libgssapi_krb5.so.2
    libkrb5.so.3 =/lib64/libkrb5.so.3
    libcom_err.so.2 =/lib64/libcom_err.so.2
    libk5crypto.so.3 =/lib64/libk5crypto.so.3
    libresolv.so.2 =/lib64/libresolv.so.2
    libcrypto.so.10 =/usr/lib64/libcrypto.so.10
    libz.so.1 =/lib64/libz.so.1
    libstdc++.so.6 =/sw/global/compilers/gcc/7.1.0/lib64/libstdc++.so.6
    libm.so.6 =/lib64/libm.so.6
    libgcc_s.so.1 =/sw/global/compilers/gcc/7.1.0/lib64/libgcc_s.so.1
    libc.so.6 =/lib64/libc.so.6
    /lib64/ld-linux-x86-64.so.2
    libbz2.so.1 =/lib64/libbz2.so.1
    liblzma.so.0 =/usr/lib64/liblzma.so.0
    libicudata.so.42 =/usr/lib64/libicudata.so.42
    libicui18n.so.42 =/usr/lib64/libicui18n.so.42
    libicuuc.so.42 =/usr/lib64/libicuuc.so.42
    libkrb5support.so.0 =/lib64/libkrb5support.so.0
    libkeyutils.so.1 =/lib64/libkeyutils.so.1
    libselinux.so.1 =/lib64/libselinux.so.1

From my findings, I think the 2nd free is the "correct" one as it's the smart pointer freeing its memory. So the first delete is wrong, but it comes from inside exit which does not help me.

How can I find, why and how that pointers is double-free'd? Note that I don't have root on the cluster, so debug symbols of GCCs libs are not available.

The used Compiler is GCC 7.1 and Boost 1.65.1 although I already tried other Boost versions and GCC 5.3

I reduced one testcase to this:

  • Link against libray
  • BOOST_AUTO_TEST_CASE(...)
  • Throw std::runtime_error

So the problem is somewhere in the static init/finalize of the library.

Flamefire
  • 5,313
  • 3
  • 35
  • 70
  • 1
    You have a bug in your code, somewhere, that ends up corrupting memory, at some point. You need to find your bug, and fix it. Without a [mcve] that anyone on stackoverflow.com can try to reproduce your error, nothing further can be said. – Sam Varshavchik Apr 09 '18 at 12:29
  • Due to the extreme size of the library tested I'm unable to compress this into a MCVE as I normally would do. So I'm asking **how** to find the bug which might not be in the library itself (it only occurs in one specific environment) I tried to apply the usual tool for this class of bugs (valgrind) but failed due to the trace leading me straight to Boost or libc. Hence the question for further analysis methods. – Flamefire Apr 09 '18 at 12:37
  • 1
    Unfortunately, there is no cookie-cutter, paint-by-numbers, documented process for debugging code. My usual approach here is to start by verifying that the bug is reliably reproducible. Then I start removing large chunks of code, one at a time, by commenting them out, or bypassing them in some way, so either they don't do anything, or return predetermined results. Eventually the bug stops occuring. Then I have some level of certainty that the bug lies in the last commented out/disabled chunk of code. The process gets repeated until the bug gets isolated. – Sam Varshavchik Apr 09 '18 at 13:35

1 Answers1

0

Are you using datasets (Data Driven Test Cases)?

If so, you might be running into https://svn.boost.org/trac10/ticket/13380

I've encountered and analyzed this before here: Boost's data-driven tests' join operator `+` corrupts first column

sehe
  • 374,641
  • 47
  • 450
  • 633