13

I am using the Slurm job scheduler to run my jobs on a cluster. What is the most efficient way to submit the Slurm jobs and check on their status using Apache Airflow?

I was able to use a SSHOperator to submit my jobs remotely and check on their status every minute until it is completed but I wonder if anyone knows a better way. Below is the SSHOperator I wrote.

sshHook = SSHHook(ssh_conn_id='my_conn_id',keepalive_interval=240)

task_ssh_bash = """
cd ~/projects &&
JID=$(sbatch myjob.sh)
echo $JID
sleep 10s # needed
ST="PENDING"
while [ "$ST" != "COMPLETED" ] ; do 
   ST=$(sacct -j ${JID##* } -o State | awk 'FNR == 3 {print $1}')
   sleep 1m
   if [ "$ST" == "FAILED" ]; then
      echo 'Job final status:' $ST, exiting...
      exit 122
   fi
echo $ST
"""

task_ssh = SSHOperator(
    task_id='test_ssh_operator',
    ssh_hook=sshHook,
    do_xcom_push=True,
    command=task_ssh_bash,
    dag=dag)
stardust
  • 177
  • 2
  • 9

1 Answers1

3

I can't give a demonstrateable example but my inclination would be to implement an airflow sensor on top of something like pyslurm. Funnily enough I came across your question merely when looking to see if anyone has already done this!

EDIT: There is an interesting topic on regarding the use of excecutors for submitting jobs too

Best of luck

JimCircadian
  • 171
  • 7
  • So, was there any success on monitoring (and maybe even submitting) SLURM jobs from Airflow? – Ivan May 30 '23 at 20:24