Boost::interprocess: scoped_lock seems impossible to be acquired after crash

Question

I'm trying to use Boost's interprocess entities for a simple task:

Having one binary, but run two of them parallelly (two separated processes), and communicate between those two.

There are two roles one instance is always exactly matched to one single role:

Primary-role
Secondary-role

Communication is done between primary and secondary roles, where primary is waiting to be requested by the secondary. Whenever primary has been requested, it replies to secondary.

Secondary, on the other hand requests the primary, and waits for its reply with a timeout. If timeout passes it upgrades itself to be the new primary.

Here is the code with explanations:

#include <iostream>
#include <cstring>
#include <thread>
#include <boost/interprocess/sync/interprocess_mutex.hpp>
#include <boost/interprocess/sync/interprocess_condition.hpp>
#include <boost/interprocess/shared_memory_object.hpp>
#include <boost/interprocess/mapped_region.hpp>
#include <boost/interprocess/sync/scoped_lock.hpp>
#include <boost/date_time/posix_time/posix_time_types.hpp>
#define SHM_NAME "/IPCTest"

using namespace boost::interprocess;

shared_info struct is the actual shared segment between the two instances. It has two important functions: primary_fn and secondary_fn.

primary_fn locks the mutex and waits on cv_primary. Whenever its wait is over, it notifies cv_secondary.

secondary_fn locks the mutex and notifies cv_primary. Then it waits with a timout on cv_secondary.

struct shared_info {
private:
    interprocess_mutex      mutex;
    interprocess_condition  cv_primary;
    interprocess_condition  cv_secondary;
    
    int test{0};
    bool primary_request_pending{false};
    bool secondary_request_pending{false};
    
public:
    void primary_fn() {
        std::cout << "primary_fn" << test++ << std::endl;
        scoped_lock<interprocess_mutex> lock(mutex);
        cv_primary.wait(lock, [=] { return primary_request_pending; });
        
        primary_request_pending = false;
        secondary_request_pending = true;
        cv_secondary.notify_all();
    }
    
    // return true if needs to be continued as secondary
    bool secondary_fn() {
        std::cout << "secondary_fn" << test++ << std::endl;

        scoped_lock<interprocess_mutex> lock(mutex);
        primary_request_pending = true;
        cv_primary.notify_all();
        
        std::cout << "secondary will wait now" << test++ << std::endl;
        
        // Wait for primary to reply
        if (cv_secondary.timed_wait(lock, 
                                boost::posix_time::microsec_clock::universal_time() + 
                                boost::posix_time::seconds(5),
                                [=] { return secondary_request_pending; })) {
                                    
            secondary_request_pending = false;
            return true;
        }
        else return false;
    }
};

Then here is the initialization part, this is where it gets detected whether the instance is actually a primary or secondary roled one.

bool primary = true;
bool error = false;
mapped_region sr;
struct shared_info* si;

void init() {
    /* Open an already created Shared region */
    try {
        shared_memory_object shm
        (open_only      //only open
        ,SHM_NAME       //name
        ,read_write     //read-write mode
        );
        
        primary = false;
        
        /* There is an existing shared memory, so map it */
        sr = mapped_region(shm, read_write);

        //Get the address of the mapped region
        void* addr = sr.get_address();

        //Obtain a pointer to the shared structure
        si = static_cast<shared_info*>(addr);
        return;
    }
    catch (interprocess_exception &ex) {
        std::cout << ex.what() << std::endl;
        primary = true;
    }
    
    /* Create a new Shared region */
    try {
        shared_memory_object shm
        (create_only    //only create
        ,SHM_NAME       //name
        ,read_write     //read-write mode
        );
        
        //Set size
        shm.truncate(sizeof(shared_info));
        
        //Map the whole shared memory in this process
        sr = mapped_region(shm, read_write);

        //Get the address of the mapped region
        void* addr = sr.get_address();

        //Construct the shared structure in memory
        si = new (addr) shared_info;
    }
    catch (interprocess_exception &ex) {
        std::cout << ex.what() << std::endl;
        error = true;
    }
}

And these are the test functions:

primary_role: will reply up to 5 times

secondary_role: will request the primary at every second, forever. When primary doesn't reply anymore, this will upgrade itself to primary_role.

void primary_role(bool upgraded) {
    std::cout << "Primary" << (upgraded? " (role upgraded)" : "") << std::endl;
    for(int i=0;i<5;i++) {
        si->primary_fn();
    }   
}

void secondary_role() {
    std::cout << "Secondary" << std::endl;
    while (si->secondary_fn()) {
        sleep(1);
    }
    std::cout << "Primary died" << std::endl;
    primary_role(true);
}

int main(int argc, char *argv[]) {
    std::cout << "STARTING..." << std::endl;
    init();
    
    if (primary) {
        primary_role(false);
    }
    else {
        secondary_role();
    }
    return 0;
}

If I start two instances (1,2), and let the primary (1) to finish after 5 iterations, the secondary (2) will realize it, and upgrade itself to the new primary (2 is the new primary). Also, by running another instance (3), 3 will start as secondary, and the everything starts all over beautifully:

However, if I don't let the first primary to finish (1) it messes up everything (2 will become the new primary, but then 3 will be in a deadlock):

As you can see, the secondary (3) is not printing this line:

secondary will wait now

Which means it's not able to acquire the mutex. I can't figure out why, it'd be greatly appreciated if someone could give a hint on what's going on wrong in here, and on how can I conclude this problem.

Note1: Primary can anytime crash or end, but this issue can be easily reproduced by only hitting "CTRL+C".

Note2: Whenever retrying, you'll have to remove the file: rm /dev/shm/IPCTest.

Note3: For compiling you need to add -lpthread -lrt.

Thanks for the thoughtful question. Sadly this is a (very) wellknown limitation. I'll try to find the best duplicate. Feel free to follow up with more targeted questions after reading there. — sehe, Oct 28 '21 at 14:04
Actually I've read some articles from year 2010, that Boost's Interprocess mutexes and conditions are not surviving crashes. Just there is no mention of this in Boost's current documentation, so I believed it was already resolved by them for now, but it seems the only way to defeat is to use named_* versions of mutex and condition, as they can survive (more) these kind of crashes and errors. Although I haven't done a very heavy testing yet, but in case you could reference here some sources, that would be great. Thx! — Daniel, Oct 28 '21 at 14:09

Boost::interprocess: scoped_lock seems impossible to be acquired after crash

0 Answers0