1

I use boost::interprocess::managed_mapped_file to create a persisted boost::interprocess::deque.

Under normal circumstances it runs smoothly!

I have however created a stress test that will spawn a process that does rapid reads and writes to the memory map. In my test I then "kill -9" that process spontaneously while running to simulate unwanted power outages.

After only a few attempts the managed_mapped_file becomes inaccessible and/or unresponsible. Probably because the file has been corrupted.

The symptoms I experience is boost::interprocess::managed_mapped_file::check_sanity() and boost::interprocess::deque::push_back(...) hangs and never return.

I assume and can accept that the file is corrupted due to uncontrollable external circumstances, but how can I detect that the file is corrupt before calling the hanging boost::interprocess::deque::push_back(...)

Rgds Klaus

1 Answers1

0

This is known as the lack of robust locking:

Killing any process that does IO with persistent media, you risk qualitative corruption, by breaking up a transactional operation midway.

In your case a lock got stuck in the "held" state, when the process holding it was terminated. Depending on your operating system, a reboot may help²

Some very carefully designed protocols/disk formats (usually journaling/log-structured databases) can detect this and automatically recover by rolling back some (partial) transactions to a known-good state.

In short, don't do kill -9 unless you don't care about throwing away the shared data. The stuck lock, in most senses, is the least of your worries.

If you know that the only corruption you are facing can be the state of synchronization primitives¹, you can do as follows:

Workaround?

The usual approach is to time out and forcefully reset the shared resources. This is somewhat easier to manage if you use a separate named interprocess mutex (because it prevents you from having to throw the entire managed segments, instead just recreating the mutex itself).

Typical interfaces for such workarounds in the wild I've seen:

$ ./myprogram --daemon
# ...
$ killall -9 myprogram

$ ./myprogram --foreground
Waiting for shared resource lock...
Timed out! Exit

$ ./myprogram --foreground --force
Recreating shared resource... Done
Working...

Of course, this is all still pretty Neanderthal.

I believe that most POSIX environments support advisory file-locks (which IIRC are also part of Boost Interprocess). These locks will be released by the kernel on process termination. You might use them to avoid the need for a reboot.

Also, a far simpler approach is to not use termination. You can send any other friendly signal instead and use it to gracefully release the shared resources avoiding the problem in the first place.


¹ (e.g. you only have a fixed-size data-structure with no updates that require transactional semantics)

² I remember wading through lots of platform dependent code in Boost Interprocess that detected when e reboot has occurred since the last timestamp on a synchronization primitive.

sehe
  • 374,641
  • 47
  • 450
  • 633