I call mpirun with "-np 2". I'm referring to the process with rank 0 as "master" and the process with rank 1 as "slave".
Goal:
- master occasionally sends a message to slave such as
mpi::send(1, UPDATE, data);
. Other message types include DIE, COMPUTE ...etc. Those message types are constant integers with unique values. - slave runs an infinite loop, "listening" to any message from the master. When it receives a message, it sends an acknowledgement back to the master.
Implementation:
slave runs:
...
int updateData, computeData;
mpi::request updateRequest = world.irecv(0,UPDATE, updateData);
mpi::request computeRequest = world.irecv(0,COMPUTE, computeData);
do {
cerr << "slave ready to take a command" << endl;
if(updateRequest.test()) {
cerr << "slave ireceived UPDATE" << endl;
world.send(0, UPDATE_ACK, 0);
cerr << "slave sent UPDATE_ACK" << endl;
/* do something useful
...
...
*/
updateRequest = world.irecv(0, UPDATE, updateData);
} else if (computeRequest.test()) {
...
} else {
boost::this_thread::sleep( boost::posix_time::seconds(1) );
}
}
while the master runs:
...
world.send(1, UPDATE, 10);
cerr << "master sent UPDATE" << endl;
int dummy;
world.recv(1, UPDATE_ACK, dummy);
cerr << "master received UPDATE_ACK" << endl;
...
more context for the master's code:
...
// update1
world.send(1, UPDATE, params);
cerr << "master sent UPDATE" << endl;
int dummy;
world.recv(1, UPDATE_ACK, dummy);
cerr << "master received UPDATE_ACK" << endl;
// update2
world.send(1, UPDATE2, params2);
cerr << "master sent UPDATE2" << endl;
world.recv(1, UPDATE2_ACK, dummy);
cerr << "master received UPDATE2_ACK" << endl;
// update3
world.send(1, UPDATE3, params3);
cerr << "master sent UPDATE3" << endl;
world.recv(1, UPDATE3_ACK, dummy);
cerr << "master received UPDATE3_ACK" << endl;
...
// training iterations
do {
mpi::request computeRecvReq1, computeRecvReq2;
std::map<int, int> result1, result2;
// for each line in a text file, the master asks the slave(s)
// to compute two things and aggregates the results
for(unsigned sentId = 0; sentId != data.size(); sentId++) {
// these two functions won't return until at least one slave is "idle"
CollectSlavesWork1(computeRecvReq1, result1);
CollectSlavesWork2(computeRecvReq2, result2);
// async ask the slave to compute and async get the results
world.isend(1, COMPUTE, sentId);
computeRecvReq1 = world.irecv(1, RESULT1, result1);
computeRecvReq2 = world.irecv(1, RESULT2, result2);
}
// based on the slave(s) work, the master updates params1
// and send them again to the slave(s)
world.send(1, UPDATE, params);
cerr << "master sent UPDATE" << endl;
world.recv(1, UPDATE_ACK, dummy); // PROBLEM HAPPENS HERE
cerr << "master received UPDATE_ACK" << endl;
} while(!ModelIsConverged())
...
Output:
...
slave ready to take a command
master sent UPDATE
slave ireceived UPDATE
slave sent UPDATE_ACK
master received UPDATE_ACK
slave ready to take a command
...
slave ready to take a command
master sent UPDATE
slave ireceived UPDATE
slave sent UPDATE_ACK
slave ready to take a command
...
Problem: the first time the master sends an UPDATE message everything seems to be alright. However, in the second time, the master doesn't receive the UPDATE_ACK.