I'm having a problem that occurs while executing Boost.Test testcases on a cluster. The error is: *** glibc detected *** ...myprogram.test: corrupted double-linked list: 0x000000000096b4d0 ***
Running valgrind on this gives me:
==9687== Invalid free() / delete / delete[] / realloc()
==9687== at 0x4A06016: operator delete(void*) (vg_replace_malloc.c:480)
==9687== by 0x3A81035D2C: __cxa_finalize (in /lib64/libc-2.12.so)
==9687== by 0x721CD05: ??? (in /lib/libboost_unit_test_framework-gcc71-mt-d-1_65_1.so.1.65.1)
==9687== by 0x72ABF9C: ??? (in /lib/libboost_unit_test_framework-gcc71-mt-d-1_65_1.so.1.65.1)
==9687== by 0x3A81035991: exit (in /lib64/libc-2.12.so)
==9687== by 0x3A8101ED23: (below main) (in /lib64/libc-2.12.so)
==9687== Address 0x9919d80 is 0 bytes inside a block of size 18 free'd
==9687== at 0x4A06016: operator delete(void*) (vg_replace_malloc.c:480)
==9687== by 0x3A81035991: exit (in /lib64/libc-2.12.so)
==9687== by 0x3A8101ED23: (below main) (in /lib64/libc-2.12.so)
The stacktrace from GDB looks like this:
#0 0x0000003a81032495 in raise () from /lib64/libc.so.6
#1 0x0000003a81033c75 in abort () from /lib64/libc.so.6
#2 0x0000003a810703a7 in __libc_message () from /lib64/libc.so.6
#3 0x0000003a81075dee in malloc_printerr () from /lib64/libc.so.6
#4 0x0000003a810761f3 in malloc_consolidate () from /lib64/libc.so.6
#5 0x0000003a81078c18 in _int_free () from /lib64/libc.so.6
#6 0x00000000005feae8 in boost::checked_array_delete<char(x=0x991a20 "\210\350\070\201:") at /include/boost-1_65_1/boost/core/checked_delete.hpp:41
#7 0x00000000005fbd21 in boost::scoped_array<char>::~scoped_array (this=0x94bd80, __in_chrg=<optimized out>) at /include/boost-1_65_1/boost/smart_ptr/scoped_array.hpp:69
#8 0x00000000005f9d36 in boost::execution_monitor::~execution_monitor (this=0x94bd60, __in_chrg=<optimized out>)
at /include/boost-1_65_1/boost/test/execution_monitor.hpp:316
#9 0x00000000005fbd3c in boost::unit_test::unit_test_monitor_t::~unit_test_monitor_t (this=0x94bd60, __in_chrg=<optimized out>)
at /include/boost-1_65_1/boost/test/unit_test_monitor.hpp:33
#10 0x0000003a81035992 in exit () from /lib64/libc.so.6
#11 0x0000003a8101ed24 in __libc_start_main () from /lib64/libc.so.6
#12 0x00000000005f5b59 in _start ()
This happens when any uncaught exception is thrown including test-failures, and on some (currently unknown) occasions. But the crash-on-exception is 100% reproducible.
The program seems fine, because locally it works w/o any such crashes. So I assume it is due to incompatibility between some modules on the cluster.
To avoid this, I recompiled Boost, and OpenBLAS but I'm still using a couple other libraries, which I don't want to rebuild (would take a lot of time) just to test each of them. Those are libSSH2, GPI2, HDF5 although they don't appear in ldd so I assume static linkage (I'm not the author of the tests) and think they are unlikely to cause problems:
linux-vdso.so.1 =
libpthread.so.0 =/lib64/libpthread.so.0
librt.so.1 =/lib64/librt.so.1
libboost_filesystem-gcc71-mt-d-1_65_1.so.1.65.1 =/lib/libboost_filesystem-gcc71-mt-d-1_65_1.so.1.65.1
libboost_program_options-gcc71-mt-d-1_65_1.so.1.65.1 =/lib/libboost_program_options-gcc71-mt-d-1_65_1.so.1.65.1
libboost_coroutine-gcc71-mt-d-1_65_1.so.1.65.1 =/lib/libboost_coroutine-gcc71-mt-d-1_65_1.so.1.65.1
libboost_context-gcc71-mt-d-1_65_1.so.1.65.1 =/lib/libboost_context-gcc71-mt-d-1_65_1.so.1.65.1
libboost_iostreams-gcc71-mt-d-1_65_1.so.1.65.1 =/lib/libboost_iostreams-gcc71-mt-d-1_65_1.so.1.65.1
libboost_regex-gcc71-mt-d-1_65_1.so.1.65.1 =/lib/libboost_regex-gcc71-mt-d-1_65_1.so.1.65.1
libboost_thread-gcc71-mt-d-1_65_1.so.1.65.1 =/lib/libboost_thread-gcc71-mt-d-1_65_1.so.1.65.1
libboost_date_time-gcc71-mt-d-1_65_1.so.1.65.1 =/lib/libboost_date_time-gcc71-mt-d-1_65_1.so.1.65.1
libboost_chrono-gcc71-mt-d-1_65_1.so.1.65.1 =/lib/libboost_chrono-gcc71-mt-d-1_65_1.so.1.65.1
libboost_atomic-gcc71-mt-d-1_65_1.so.1.65.1 =/lib/libboost_atomic-gcc71-mt-d-1_65_1.so.1.65.1
libboost_system-gcc71-mt-d-1_65_1.so.1.65.1 =/lib/libboost_system-gcc71-mt-d-1_65_1.so.1.65.1
libboost_serialization-gcc71-mt-d-1_65_1.so.1.65.1 =/lib/libboost_serialization-gcc71-mt-d-1_65_1.so.1.65.1
libdl.so.2 =/lib64/libdl.so.2
libssl.so.10 =/usr/lib64/libssl.so.10
libgssapi_krb5.so.2 =/lib64/libgssapi_krb5.so.2
libkrb5.so.3 =/lib64/libkrb5.so.3
libcom_err.so.2 =/lib64/libcom_err.so.2
libk5crypto.so.3 =/lib64/libk5crypto.so.3
libresolv.so.2 =/lib64/libresolv.so.2
libcrypto.so.10 =/usr/lib64/libcrypto.so.10
libz.so.1 =/lib64/libz.so.1
libstdc++.so.6 =/sw/global/compilers/gcc/7.1.0/lib64/libstdc++.so.6
libm.so.6 =/lib64/libm.so.6
libgcc_s.so.1 =/sw/global/compilers/gcc/7.1.0/lib64/libgcc_s.so.1
libc.so.6 =/lib64/libc.so.6
/lib64/ld-linux-x86-64.so.2
libbz2.so.1 =/lib64/libbz2.so.1
liblzma.so.0 =/usr/lib64/liblzma.so.0
libicudata.so.42 =/usr/lib64/libicudata.so.42
libicui18n.so.42 =/usr/lib64/libicui18n.so.42
libicuuc.so.42 =/usr/lib64/libicuuc.so.42
libkrb5support.so.0 =/lib64/libkrb5support.so.0
libkeyutils.so.1 =/lib64/libkeyutils.so.1
libselinux.so.1 =/lib64/libselinux.so.1
From my findings, I think the 2nd free is the "correct" one as it's the smart pointer freeing its memory. So the first delete is wrong, but it comes from inside exit
which does not help me.
How can I find, why and how that pointers is double-free'd? Note that I don't have root on the cluster, so debug symbols of GCCs libs are not available.
The used Compiler is GCC 7.1 and Boost 1.65.1 although I already tried other Boost versions and GCC 5.3
I reduced one testcase to this:
- Link against libray
BOOST_AUTO_TEST_CASE(...)
- Throw
std::runtime_error
So the problem is somewhere in the static init/finalize of the library.