0

I have a core dump file where pthread_mutex_destroy() has returned an error, probably because in the pthread_mutex_t data structure the __nusers field is set to 4294967295 (0xFFFFFFFF). Here are the full values:

mMutex = {
    __data = {
        __lock = 0,
        __count = 0,
        __owner = 0,
        __nusers = 4294967295,
        __kind = 1,
        __spins = 0,
        __elision = 0,
        __list = {
            __prev = 0x0,
            __next = 0x0
        }
    },
    __size = '\000' <repeats 12 times>, "\377\377\377\377\001", '\000' <repeats 22 times>,
    __align = 0
}

This is a recursive mutex. The code is running on a RHEL 8 system.

So at first glance this looks like __nusers was somehow decremented once too often. But I don't see how this could happen - calling pthread_mutex_unlock() without locking first leaves the __nusers count at 0 (it returns EPERM, but there shouldn't be any undefined behavior involved for a recursive mutex).

Under what circumstances would __nusers become essentially "negative"?

oliver
  • 6,204
  • 9
  • 46
  • 50
  • What error is returned by `pthread_mutex_destroy`? – G.M. Feb 25 '21 at 12:22
  • 1
    I suspect mutex has been used before was initialized or after it was destroyed (destroyed when lock is acquired or locked after destruction). – Marek R Feb 25 '21 at 12:30
  • @G.M. I don't know the exact error code, since it's optimized out in the core dump :-/ I only know that it's not 0. But if I understand https://code.woboq.org/userspace/glibc/nptl/pthread_mutex_destroy.c.html#__pthread_mutex_destroy correctly, the function will return EBUSY for a recursive mutex if __nusers is not 0. – oliver Feb 25 '21 at 13:07
  • Are you saying that the program crashed and dumped core *in* `pthread_mutex_destroy()`, or, as the question text seems to indicate, that you obtained a core dump from some point after `pthread_mutex_destroy()` returned an error code? If the latter, then have you indeed neither captured the return value (which should then be readable from the core dump) nor used `perror()` to emit a characteristic message to `stderr`? – John Bollinger Feb 25 '21 at 16:51
  • @JohnBollinger: the crash (actually the abort() call) is after `pthread_mutex_destroy()`, in my code. That code checks whether the return value is 0, but unfortunately does not print the return value in any way. And even though there's a local variable storing the return value, that variable has been optimized out by the compiler, so I can't see its value in the debugger. – oliver Feb 26 '21 at 09:09
  • Well, as a matter of style and code quality, I would recommend that you precede any call to `abort()` with emitting some kind of diagnostic to `stderr`. You might find the `perror()` function appropriate for this, but because `pthread_mutex_destroy()` returns an error number instead of setting the `errno` variable, you would need to set `errno` yourself. Alternatively, you could use `strerror()` to get the meat of a diagnostic. I would suggest putting in such a change now, and seeing what it tells you when the failure next occurs. – John Bollinger Feb 26 '21 at 12:26
  • You might also consider running your program under valgrind, as memory corruption from overrunning object bounds is a possible explanation, but do understand that Valgrind cannot catch all such cases. – John Bollinger Feb 26 '21 at 12:29
  • You should also consider working on a [mre]. In part because presenting one would help us help you, but more, in this case, because the exercise is a powerful debugging technique in its own right. – John Bollinger Feb 26 '21 at 12:31

0 Answers0