Corrupted stack root cause detection

Question

I have a problem with corrupted stack in multithreaded application.

There is a class:

class A {
public:
/// some public methods
private:
some references to other objects like:
ClassA& ref;
ClassB& ref2;
...
some fields like:
std::map<std::string, enumClass> ...
std::mutex ...
std::map<std::string, someClass> ...
std::mutex again some mutex
std::map<string, std::pair<ClassB, someEnum>> corrupted_map;
bool isTrue;
};

To be more specific issue appeared as a segmentation fault. And that segfault is caused by operator[] on corrupted_map. After debug session it also appeared that one of the field of stl tree has been changed without any operation on corrupted_map. That is why I think it is stack memory corruption. Right leaf of the stl black red tree header points to inaccessible memory. Further investigation shows that another map operation corrupts corrupted_map. In addition another problem is that reproduction of the mention issue takes about 30minutes and requires a lot of traffic. (one of the boxtests).

Analysing core dump is pointless, because corruption happened about 1-2minutes before core dump.

The question for you experts is: how to detect origin of that stack memory corruption? another tools?

I tried with:

ASAN address sanitizer - nothing detected until segfault

GDB - too slow, application is killed before reproduction, a lot of watchdogs, time dependency etc

valgrind - also too slow / and unit tests - nothing detected

static code analyzers - nothing detected

TSAN - thread sanitizer - fixed some detected issues and did not help

I found place which corrupts map with additiona thread that scans stl tree fields every 2ms + additional checks for suspicious methods but well, probably that map operations which is causing mentioned issue is also corrupted.

You appear to have hit the problem with much the same tools I would use. The rest is grunt work. Note that since you seem more interested in tool recommendations than someone magically discerning a fix from what you probably already know is too little code, this question is probably off topic as a tool request or needs focus because it's asking about general multithread debugging tips. — user4581301, Apr 14 '23 at 20:05
Needs a minumum reproducible samle of error causing code. Any sort of UB could lead catastrophic consequences. That is the reason behind promoting guidelines. Pointers, arrays, references, data shared between threads... all could cause such behavior; each of which can be safeguarded differently. In the design of your code, UB considerations - along with maintenance - must alway have the top priority. But most often a fast solution is what teams settle with. Fast solutions are painful developer slayers. — Red.Wave, Apr 14 '23 at 20:07
Most memory corruption comes from misuse of pointers. In C++, if you avoid use of raw pointers and C-style arrays, most of these thould be mitigated. Use smart pointers, STL containers, etc. — Barmar, Apr 14 '23 at 20:13
most of `corrupted_map` isn't stored on the stack, its on the heap so your conclusion is likely wrong. We'll need a [mre] to help you — Alan Birtles, Apr 14 '23 at 20:14
@AlanBirtles well, header of the stl map (stl red black tree) is stored on the stack. I mean pointers _M_parent, _M_rights, _M_left... and one of the pointer is corrupted, not pointed stuff, but pointer value, that is why I assume there is a problem with stack memory. — memsetter, Apr 14 '23 at 20:15
And is your instance of `A` stored on the stack? Seems unlikely with your multi threaded code — Alan Birtles, Apr 14 '23 at 20:18
@Red.Wave I would love to have reproducible example, but I probably it would required to paste code of the project (what I cannot do) + source of the boxtest :D, I did not manage to reproduce it without huge network traffic. — memsetter, Apr 14 '23 at 20:18
@AlanBirtles yes it is. It is a member of the "greater" class. And that "greater" class is also instance created on stack. — memsetter, Apr 14 '23 at 20:20
If the corrupted pointer is not accessed often you can try to set a breakpoint on change of its memory address and see if and when that triggers. — Pepijn Kramer, Apr 14 '23 at 20:20
@PepijnKramer it is nearly impossible to use gdb with this app — memsetter, Apr 14 '23 at 20:21
This is tagged [tag:linux], so I assume you are using GNU libstdc++. In this case you could compile with `-D_GLIBCXX_ASSERTIONS` to enable bounds cheching and algorithm precondition checks in the entire standard library. If that doesn't help you can go a step further and compile with `-D_GLIBCXX_DEBUG`, but that comes with extra caveats because it's not ABI compatible. See these manual pages for details: https://gcc.gnu.org/onlinedocs/libstdc++/manual/using_macros.html, https://gcc.gnu.org/onlinedocs/libstdc++/manual/debug_mode.html — Henri Menke, Apr 14 '23 at 20:29

score 1 · Accepted Answer · answered Apr 15 '23 at 01:32

how to detect origin of that stack memory corruption?

Almost certainly this is not stack corruption, but heap corruption: none of the elements of the map are on stack.

ASAN address sanitizer - nothing detected until segfault

That is surprising -- ASan is usually very good at detecting heap corruption.

There are a few ways I'd approach this:

run the ASan test 10 or more times.
adjust ASan runtime flags, in particular quarantine_size_mb.

Why (1)? Sometimes ASan detects a problem and starts reporting it, but before it can finish another thread hits SIGSEGV and causes the process to die without any reports. Repeating the test 100 times may get you a report in one of them; one should be enough!

Why (2)? As flag description says, use-after-free may not be detected if you are doing a lot of allocations.

You could also enable detect_stack_use_after_return, and it may detect existing errors, though I doubt you really have a stack problem here.

P.S. Henri Menke's suggestion to use -D_GLIBCXX_ASSERTIONS and -D_GLIBCXX_DEBUG is also very good one. Documentation.

I tried with `-D_GLIBCXX_ASSERTIONS` (forgot to mention) and it also helped to fix another undefined behaviour. I think ASan should detect issue when map is getting corrupted (`corrupted_map` is not used for up to 1-2 minutes), then with operator[] on the map there is just segfault during deferencing inaccessible pointer (to be more precise when comparing object). — memsetter, Apr 15 '23 at 06:25

score 1 · Answer 2 · answered Apr 15 '23 at 13:09

Solved. Indeed I only used -D_GLIBCXX_ASSERTIONS, but D_GLIBCXX_DEBUG showed its power. The mistake was extremely basic - writing to the iterator which points to end (equal to .end()) of the previous map. And yup, you were right guys, it was corruption of the heap - not stack.

Corrupted stack root cause detection

2 Answers2