I'm using Sun GridEngine (Rocks Cluster) on a server to run remote jobs.
When I try to remove jobs with qdel
, it often works as expected, but every now and then it just deletes almost everything it finds.
For example, at some point today I had 77 running jobs:
[znorg@server MD]$ qstat job-ID prior name user state submit/start at queue slots ja-task-ID ----------------------------------------------------------------------------------------------------------------- 7711 0.55500 shg_oAll_c znorg dr 10/30/2012 13:49:07 all.q@compute-0-22.local 1 7712 0.55500 shg_oCAB_c znorg dr 10/30/2012 13:49:07 all.q@compute-0-22.local 1 7873 0.55500 a1h3l_prdA znorg r 11/08/2012 13:37:22 all.q@compute-0-0.local 1 7874 0.55500 a1t8k_obsA znorg r 11/08/2012 13:37:22 all.q@compute-0-18.local 1 7875 0.55500 a1t8k_prdA znorg r 11/08/2012 13:37:37 all.q@compute-0-15.local 1 7877 0.55500 a3zr8_prdA znorg r 11/08/2012 13:37:37 all.q@compute-0-17.local 1 7878 0.55500 b1nez_obsA znorg r 11/08/2012 13:37:52 all.q@compute-0-23.local 1 7880 0.55500 b2j73_obsA znorg r 11/08/2012 13:37:52 all.q@compute-0-20.local 1 (...) 7955 0.55500 b2qcp_prdE znorg r 11/08/2012 13:44:07 all.q@compute-0-32.local 1 7956 0.55500 c3o2e_obsE znorg r 11/08/2012 13:44:22 all.q@compute-0-29.local 1 7960 0.55500 c3zzp_obsE znorg r 11/08/2012 13:44:37 all.q@compute-0-27.local 1 7995 0.55500 s1enh_prdA znorg r 11/22/2012 16:06:24 all.q@compute-0-33.local 1 7996 0.55500 s1igd_prdA znorg r 11/22/2012 16:06:39 all.q@compute-0-33.local 1 7997 0.55500 s1ixs_prdA znorg r 11/22/2012 16:06:39 all.q@compute-0-33.local 1 (...) 8008 0.55500 s1igd_prdD znorg r 11/22/2012 16:07:39 all.q@compute-0-5.local 1 8009 0.55500 s1ixs_prdD znorg r 11/22/2012 16:07:39 all.q@compute-0-13.local 1 8010 0.55500 s1shg_prdD znorg r 11/22/2012 16:07:39 all.q@compute-0-31.local 1
I wanted to delete the last 16 jobs, so I typed:
[znorg@server MD]$ qdel 7995 7996 7997 7998 7999 8000 8001 8002 8003 8004 8005 8006 8007 8008 8009 8010
Which returned:
znorg has registered the job 7995 for deletion znorg has registered the job 7996 for deletion znorg has registered the job 7997 for deletion znorg has registered the job 7998 for deletion znorg has registered the job 7999 for deletion znorg has registered the job 8000 for deletion znorg has registered the job 8001 for deletion znorg has registered the job 8002 for deletion znorg has registered the job 8003 for deletion znorg has registered the job 8004 for deletion znorg has registered the job 8005 for deletion znorg has registered the job 8006 for deletion znorg has registered the job 8007 for deletion znorg has registered the job 8008 for deletion znorg has registered the job 8009 for deletion znorg has registered the job 8010 for deletion
So far so good, looks like it's going as expected.
But then when I checked again, almost all other jobs were gone:
[znorg@server MD]$ qstat job-ID prior name user state submit/start at queue slots ja-task-ID ----------------------------------------------------------------------------------------------------------------- 7712 0.55500 shg_oCAB_c znorg dr 10/30/2012 13:49:07 all.q@compute-0-22.local 1 7893 0.55500 a1t8k_prdB znorg r 11/08/2012 13:39:07 all.q@compute-0-16.local 1 7929 0.55500 a1t8k_prdD znorg r 11/08/2012 13:42:07 all.q@compute-0-16.local 1
Am I doing something wrong? What could be happening here?