0

I am using Amazon EC2 nodes and running an MPI parallel program in C. I am using starcluster to manage the instances. The program compiles fine using mpicc. The executable is then on a mounted space shared by all nodes. However, when I run the executable using mpirun, sometimes old versions of the executable load instead.

For example, if I have a master and 9 nodes, and print "Version 1.0", I'll get 10 string outputs of "Version 1.0". If I update the code to print "Version 1.1", and compile on the master, then run instantly, I'll get one line of "Version 1.1" and 9 lines of "Version 1.0"... unless I wait another minute or two to run, then I get all ten lines of "Version 1.1".

Why is there such a delay for the other nodes to update their executable? Is it an issue with MPIcc? The way I am mounting the shared space?

HoosierPhysics
  • 133
  • 1
  • 8
  • It is not an issue with `mpicc`. It knows nothing about your EC2 infrastructure. It just takes source files and writes somewhere a binary executable file in the end. What happens afterwards is not of `mpicc`s concern, or of whatever else compiler's in fact. Propagation delays in distributed and network storage systems due to caching are something normal. You should learn to live with them or find other mechanisms to prevent/overcome them. Simple solution - name your executables differently after each build and the caching issue will be gone. – Hristo Iliev Nov 18 '16 at 08:55
  • Ah, that is a great solution. Thanks – HoosierPhysics Nov 18 '16 at 18:39

0 Answers0