1

We are using docker devcontainers for our development.

The container is running Ubuntu 22.04, gcc-11.3 and valgrind-3.18.1

We run our unit tests through valgrind in order to check for leaks etc, using the following command:

/usr/bin/valgrind \
    -v \
    --leak-check=full \
    --show-leak-kinds=definite,indirect,possible \
    --num-callers=50 \
    --track-origins=yes \
    --error-exitcode=1 \
        /src/.build/release/utils/test/foo_unit_test

The unit test is using embedded python, and calls Py_Finalize() at the end to clean up python allocations.

GTEST_API_ int main(int argc, char** argv)
{
    testing::InitGoogleTest(&argc, argv);
    int res = RUN_ALL_TESTS();
    Py_Finalize();                          // clean up python allocations
    return res;
}

Our company uses aws workspaces for employee workstations, so we're running our devcontainer on Amazon Linux 2; which reports the following:

$ lsb_release 
LSB Version:    :core-4.1-amd64:core-4.1-noarch

$ uname -a
Linux workstation 5.15.93-55.139.amzn2.x86_64 #1 SMP Tue Feb 14 21:47:11 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Running the above valgrind command within our Ubuntu 22.04 docker container on the above Amazon Linux 2 host, no leaks are reported.

However, we also have a self-hosted CI build server, which is running Ubuntu 20.04 as the host OS; which reports the following:

$ cat /etc/lsb-release 
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=20.04
DISTRIB_CODENAME=focal
DISTRIB_DESCRIPTION="Ubuntu 20.04.5 LTS"

$ uname -a
Linux build_server 5.15.0-48-generic #54~20.04.1-Ubuntu SMP Thu Sep 1 16:17:26 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

Running the same valgrind command within the same Ubuntu 22.04 docker container on the build server reports the following leaks:

==333== HEAP SUMMARY:
==333==     in use at exit: 647,380 bytes in 299 blocks
==333==   total heap usage: 265,491 allocs, 265,192 frees, 85,629,984 bytes allocated
==333== 
==333== Searching for pointers to 299 not-freed blocks
==333== Checked 18,296,672 bytes
==333== 
==333== 568 bytes in 1 blocks are possibly lost in loss record 28 of 169
==333==    at 0x4848899: malloc (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==333==    by 0x4AAAEB8: _PyObject_GC_Malloc (in /usr/lib/x86_64-linux-gnu/libpython3.10.so.1.0)
==333==    by 0x4AAB1EB: _PyObject_GC_NewVar (in /usr/lib/x86_64-linux-gnu/libpython3.10.so.1.0)
...
==333==    by 0x4A6B1FD: PyImport_ImportModule (in /usr/lib/x86_64-linux-gnu/libpython3.10.so.1.0)
==333==    by 0x4A45E6C: _PyCodec_Lookup (in /usr/lib/x86_64-linux-gnu/libpython3.10.so.1.0)
==333==    by 0x4A808AE: Py_InitializeFromConfig (in /usr/lib/x86_64-linux-gnu/libpython3.10.so.1.0)
==333==    by 0x4A82C1B: Py_InitializeEx (in /usr/lib/x86_64-linux-gnu/libpython3.10.so.1.0)

==333== 664 bytes in 1 blocks are possibly lost in loss record 77 of 169
==333==    at 0x484DCD3: realloc (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==333==    by 0x4AA6238: _PyObject_GC_Resize (in /usr/lib/x86_64-linux-gnu/libpython3.10.so.1.0)
==333==    by 0x48F362D: _PyEval_EvalFrameDefault (in /usr/lib/x86_64-linux-gnu/libpython3.10.so.1.0)
...
==333==    by 0x4961EB5: _PyObject_FastCallDictTstate (in /usr/lib/x86_64-linux-gnu/libpython3.10.so.1.0)
==333==    by 0x496204F: _PyObject_Call_Prepend (in /usr/lib/x86_64-linux-gnu/libpython3.10.so.1.0)
==333==    by 0x496222B: _PyObject_Call (in /usr/lib/x86_64-linux-gnu/libpython3.10.so.1.0)

etc...

==333== LEAK SUMMARY:
==333==    definitely lost: 0 bytes in 0 blocks
==333==    indirectly lost: 0 bytes in 0 blocks
==333==      possibly lost: 2,728 bytes in 4 blocks
==333==    still reachable: 644,652 bytes in 295 blocks
==333==         suppressed: 0 bytes in 0 blocks
==333== Reachable blocks (those to which a pointer was found) are not shown.
==333== To see them, rerun with: --leak-check=full --show-leak-kinds=all
==333== 
==333== ERROR SUMMARY: 4 errors from 4 contexts (suppressed: 0 from 0)
  • I thought Valgrind did an LD_PRELOAD on malloc etc, which is surely the same on both machines since they're running the same version, so why am I getting different results when running the same code in the same docker container running on different hosts?

  • Without having to resort to suppressions, is there anything I can do to ensure this memory is released and valgrind doesn't think it's a possible leak?

Steve Lorimer
  • 27,059
  • 17
  • 118
  • 213

1 Answers1

0

If you fix your 295 leaks then they will no longer be reported. OK, that's not really helpful. If one config is leaking and not the other then they are doing something different. You need to do some parallel debugging to see where one is freeing the memory and why the other isn't.

You may well see some variation between platforms for the detection of 'possible' and 'definite' leaks. The difference between the two is that on program exit memcheck searches memory and registers for pointers to leaked blocks. If it finds a pointer to the tart of a block it counts it as 'in use', if it finds a pointer to somewhere in side the block it counts it as 'potential' and if if it finds no pointer to the block is it 'definite'.

'Possible' leaks are usually caused by random values or leftovers or things like memory pools or allocators that do things like add a redzone or lngth at the start of the allocated memory and return the address after that.

Finally, Valgrind doesn't use LD_PRELOAD to replace malloc. It intercepts the syscall to mmap. When that's a file it takes a peek at the file, and it it's an ELF shared library (or macho shared library on macOS) it will trigger reading DWARF debug info and redirection of functions. When it sees malloc and family, it just replaces the function pointer with its own intercept. In the case of an exe that links with a static libc all of the above gets done when the exe is loaded using the same mechanism.

Paul Floyd
  • 5,530
  • 5
  • 29
  • 43
  • It's the exact same source code, built using the exact same toolchain running inside a docker container of the exact same docker image. The only variable is the host machine the docker container is running on. How on earth do you debug that? – Steve Lorimer Mar 16 '23 at 12:03
  • Start with the machine without leaks. Use gdb and put a breakpoint where memcheck says the leaking memory is allocated. When you get there, note the address and then set a conditional breakpoint on free with a condition that the address to be freed is the noted address. When you get there note the backtrace. Then try to debug on the second machine to see why you don't free the memory. – Paul Floyd Mar 16 '23 at 13:29
  • Also look at the output of lscpu and particularly the Flags on the two machines. That may change the code produced by the compler and also the code executed in libc. – Paul Floyd Mar 16 '23 at 13:30
  • Thanks, will give is a bash – Steve Lorimer Mar 16 '23 at 13:39