1

I am running an mpi program on a cluster. When the program ends the job does not. And so I have to wait for it to time out.

I am not sure how to debug this. I checked that the program got to the finalize statement in MPI, and it does. I am using lib Elemental.

Final lines of the program

if (grid.Rank() == 0) std::cout << "Finalize" << std::endl;
std::string message = std::string("rank_") +
std::to_string(mpi::Rank(mpi::COMM_WORLD)) + "_a";
std::cout << message;
Finalize();
message = message + "b";
std::cout << message;
mpi::Finalize();
message = message + "c";
std::cout << message;
return 0;

The output will is

Finalize
rank_0_arank_0_abrank_0_abcmpiexec: killall: caught signal 15 (Terminated).
mpiexec: kill_tasks: killing all tasks.
mpiexec: wait_tasks: waiting for taub205.
mpiexec: killall: caught signal 15 (Terminated).
=>> PBS: job killed: walltime 801 exceeded limit 780
----------------------------------------
Begin Torque Epilogue (Tue Nov  4 16:15:19 2014)
Job ID:           ***
Username:         ***
Group:            ***
Job Name:         mpi_test1
Session:          11270
Limits:
ncpus=1,neednodes=1:ppn=6:m24G:taub,nodes=1:ppn=6:m24G:taub,walltime=00:13:00
Resources:        cput=00:02:12,mem=429524kb,vmem=773600kb,walltime=00:13:21
Job Queue:        secondary
Account:          ***
Nodes:            taub205
End Torque Epilogue
----------------------------------------

Running these modules on https://campuscluster.illinois.edu/hardware/#taub

> module list
Currently Loaded Modulefiles:
  1) torque/4.2.9              5) gcc/4.7.1
  2) moab/7.2.9                6) mvapich2/2.0b-gcc-4.7.1
  3) env/taub                  7) mvapich2/mpiexec
  4) blas                      8) lapack
exrhizo
  • 141
  • 2
  • 13
  • The [documentation of Elemental](http://libelemental.org/documentation/0.85-RC1/core/environment.html) states that the `Finalize()` of Elemental frees all resources allocated by Elemental and (if necessary) MPI... My guess is that it calls `MPI_Finalize()` and that calling it twice creates the problem. What happens if you remove `mpi::Finalize();` ? – francis Nov 03 '14 at 18:45
  • Actually that is what I originally did, and I added MPI_Finalize() to see if that fixed the issue. with elemental it allows you to initialize and finalize MPI if you do both outside. – exrhizo Nov 03 '14 at 21:08
  • Ok...my guess is wrong...`if (grid.Rank() == 0) std::cout << "Finalize" << std::endl;` checks that process 0 is ready to finalize. Maybe one of your process is waiting for a message somewhere. Could you try to add something like `mpi::Barrier()` before printing finalize, or `std::cout << "Finalize " < – francis Nov 04 '14 at 10:27
  • When I run one node the result is the same. I will edit to give a more clear test case. – exrhizo Nov 04 '14 at 23:19
  • Looking at your output, you run your code on one node, one cpu, and 6 processes per cpu (`ppn`). And the only one printing `rank_0_arank_0_abrank_0_abc` is the 0 process : the five other processes are waiting for something...until walltime limit. It is strange that proc 0 managed to reach `return 0`. Does it work properly on your personnal computer, using 2 processes ? I installed elemental, and the [SVD example](https://github.com/elemental/Elemental/blob/master/examples/lapack_like/SVD.cpp) worked fine, compiled with `mpiCC main.cpp -o main -std=c++11 -lEl` and run with `mpirun -np 2 main` – francis Nov 05 '14 at 10:17
  • Yes it works on my computer (although locally I am using mpich not mvapich2). I didn't include the part where I invoke mpi with mpiexec -verbose -n 1 ./distributed_memory/aps ppn is referring to the number of threads, either 6 or 12, 6 means one core. I was having the same problem on the cluster with gemm.cpp (another example) – exrhizo Nov 08 '14 at 05:58

0 Answers0