I am using the Slurm job scheduler to run my jobs on a cluster. What is the most efficient way to submit the Slurm jobs and check on their status using Apache Airflow?
I was able to use a SSHOperator to submit my jobs remotely and check on their status every minute until it is completed but I wonder if anyone knows a better way. Below is the SSHOperator I wrote.
sshHook = SSHHook(ssh_conn_id='my_conn_id',keepalive_interval=240)
task_ssh_bash = """
cd ~/projects &&
JID=$(sbatch myjob.sh)
echo $JID
sleep 10s # needed
ST="PENDING"
while [ "$ST" != "COMPLETED" ] ; do
ST=$(sacct -j ${JID##* } -o State | awk 'FNR == 3 {print $1}')
sleep 1m
if [ "$ST" == "FAILED" ]; then
echo 'Job final status:' $ST, exiting...
exit 122
fi
echo $ST
"""
task_ssh = SSHOperator(
task_id='test_ssh_operator',
ssh_hook=sshHook,
do_xcom_push=True,
command=task_ssh_bash,
dag=dag)