3

I'm trying to schedule a series of mpi jobs on an Ubuntu 14.04 LTS machine using a bash script. Basically, I want a simulation to run on every core for a certain amount of time, then terminate and move on to the next case once that time has elapsed.

My issue arises when mpi exits at the end of the first job - it breaks the loop and returns the terminal to my control instead of heading onto the next iteration of the loop.

My script is included below. The file "case_names" is just a text file of directory names. I've tested the script with other commands and it works fine until I uncomment the mpirun call.

#!/bin/bash

while read line;
do
    # Access case dierctory
    cd $line
    echo "Case $line accessed"

    # Start simulation
    echo "Case $line starting: $(date)"
    mpirun -q -np 8 dsmcFoamPlus -parallel > log.dsmcFoamPlus &

    # Wait for 10 hour runtime
    sleep 36000

    # Kill job
    pkill mpirun > /dev/null
    echo "Case $line terminated: $(date)"

    # Return to parent directory
    cd ..
done < case_names

Does anyone know of a way to stop mpirun from breaking the loop like this?

So far I've tried GNOME task scheduler and task-spooler, but neither have worked (likely due to aliases that have to be invoked before the commands I use become available). I'd really rather not have to resort to setting up slurm. I've also tried using the disown command to separate the mpi process from the shell I'm running the scheduling script in, and have even written a separate script just to kill processes which the scheduling script runs remotely.

Many thanks in advance!

nathanDonaldson
  • 353
  • 1
  • 9
  • I wonder how you are so concerned about breaking the loop, but not at all about the results from your simulation. This seems like a super wonky way to basically guarantee your results will be broken. Can you not make a timeout in your program? Or [forward a user signal](http://stackoverflow.com/a/15881972/620382) to properly terminate the application. – Zulan Mar 24 '17 at 17:31
  • The code I'm using focuses on averaging of Monte Carlo variables for its solutions, so it converges toward a single solution over time. As such, stopping it early simply limits the accuracy of the final solution rather than corrupting it outright. There is a timeout option in the program (which I will likely end up using), but it still terminates mpi in the same way. The answer I posted seems to work for both killing the process externally and for its internal timeout option. – nathanDonaldson Mar 24 '17 at 21:20

1 Answers1

1

I've managed to find a workaround that allows me to schedule tasks with a bash script like I wanted. Since this solves my issue, I'm posting it as an answer (although I would still welcome an explanation as to why mpi behaves in this way in loops).

The solution lay in writing a separate script for both calling and then killing mpi, which would itself be called by the scheduling script. Since this child bash process has no loops in it, there are no issues with mpi breaking them after being killed. Also, once this script has exited, the scheduling loop can continue unimpeded.

My (now working) code is included below.

Scheduling script:

while read line;
do
    cd $line
    echo "CWD: $(pwd)"
    echo "Case $line accessed"
    bash ../run_job
    echo "Case $line terminated: $(date)"
    cd ..
done < case_names

Execution script (run_job):

mpirun -q -np 8 dsmcFoamPlus -parallel > log.dsmcFoamPlus &
echo "Case $line starting: $(date)"
sleep 600
pkill mpirun

I hope someone will find this useful.

nathanDonaldson
  • 353
  • 1
  • 9