1

As a noob, I have access to supercomp with SLURM. The squeue command gives the list of used nodes for various jobs of different users. A small example is given below.

Why do some users e.g. user1 (that's actually me) have one line (see below), while almost all other users have hundreds of lines (below is just a very small excerpt from user2, he/she has a lot more)?

I understand that reason is probably that all of them are actually different jobs (with different JOBID), but I am curious is this example of the right usage of resources?

  JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
3227434 partiti_1 jobname1     user1  R      30:53     20 node[101-110,170-179]
3227442 partiti_1 jobname2     user2  R       0:51      1 node124
3227447 partiti_1 jobname2     user2  R       0:51      1 node124
3227448 partiti_1 jobname2     user2  R       0:51      1 node124
3227458 partiti_1 jobname2     user2  R       0:51      1 node125
3227475 partiti_1 jobname2     user2  R       0:51      1 node125
3227501 partiti_1 jobname2     user2  R       0:51      1 node125
3227514 partiti_1 jobnam3a     user2  R       0:34      1 node150
3227527 partiti_1 jobnam3b     user2  R       0:34      1 node150
3227528 partiti_1 jobnam3c     user2  R       0:25      1 node353
3227529 partiti_1 jobnam3d     user2  R       0:20      1 node353
3227530 partiti_1 jobnam31     user2  R       0:12      1 node336
3227531 partiti_1 jobnam32     user2  R       0:12      1 node321
3227532 partiti_1 jobnam33     user2  R       0:12      1 node336
3227533 partiti_1 jobnam34     user2  R       0:08      1 node323
3227534 partiti_1 jobnam35     user2  R       0:07      1 node323
3227535 partiti_1 jobnam36     user2  R       0:06      1 node322

Thanks!

multipole
  • 111
  • 3
  • Those users have submitted all those jobs that's why you're seeing more than one. You can submit another job and observe the same for your case as well. – Azeem Jan 22 '22 at 11:28
  • Thanks, but considering that there are hundreds of such lines in the same or very short time intervals, does that seem like right usage of resources? – multipole Jan 22 '22 at 11:52
  • It really depends on the kind of job you're running, its requirements, and the resources allocated for that user. For example, the job you're running (3227434) is using 20 nodes. Without looking at your slurm script or an idea about the resource requirements, one cannot determine whether it's the optimal use of resources or not. But it looks like your script needs 20 distinct nodes to be available to run successfully. Looking at some jobs submitted by `user2` (e.g. 3227442, 3227447, 3227448 on node124), we can observe that some nodes are running more than one job. – Azeem Jan 22 '22 at 12:02
  • 2
    To objectively answer your question about the "right usage of resources", as the users of an HPC cluster we don't need to worry about that. For an HPC cluster admin, that might be an interesting question to look at. For some use-cases, a cluster may not have all the resources at all. That could happen too. – Azeem Jan 22 '22 at 12:07
  • @Azeem Thank you for your assistance. As for your observation that as an ordinary user I should not worry about that.. I am asking because my jobs occasionally aborts with error like "transport retry counter exceeded syndrom".. I noticed that it happens when users like user2 have their jobs. So I suspect if such tons of "overly distributed" jobs somehow interfere with my jobs, cousing them to abort.. – multipole Jan 22 '22 at 16:34
  • So my question basically is if hundreds of jobs of other users is something "normal", or something what I should consider reporting to admin, if they have impact on my jobs. Thanks again. – multipole Jan 22 '22 at 16:40
  • Sure, no problem! Not really. That should not be the case. You've been allocated the resources. BTW, that's a huge amount of nodes that you're using (not sure about the RAM and CPU cores though - must be huge too). The rest of the jobs won't interfere with your jobs per se. Depending on your slurm scripts, there might be some debug logs around that you can look at. Looks like the error is about [ucx](https://github.com/openucx/ucx). Here's the related issue: https://github.com/openucx/ucx/issues/1880. You should look at this if it's related to your job or an already installed cluster module. – Azeem Jan 22 '22 at 16:46
  • The jobs are not run right away if the cluster is busy (i.e. the required resources are not available). For example, if a job required 10 nodes with 10 CPUs and 64GB RAM then the job will be queued (pending state) until these resources are available. On the availability of the required resources, the job will be run. – Azeem Jan 22 '22 at 16:57
  • @Azeem Thanks for clarification that problem is not what I suspected. Yes in log file appears UCX ("dc_verbs.c:609 UCX ERROR Send completion with error on qp 0x1557: transport retry counter exceeded syndrome 0x81" and "dc_verbs.c:609 UCX ERROR Send completion with error on qp 0x1557: Work Request Flushed Error syndrome 0xf9" and dosen of lines with details of error). I really don't understand what error means, nor I understand links you just gave. Anyway that's a different topic. Should I ask question about that here or inform admin of HPC (which I rather avoid)? Thanks again for patience – multipole Jan 22 '22 at 18:05
  • First, you need to figure out whether this is related to your own job code or same module that has been loaded to run your job. If it's related to some module, you'll need to discuss with the admin. If it's related to your code, you need to figure out what's wrong with it. You can open an issue in UCX repo. – Azeem Jan 22 '22 at 18:34
  • Thanks. I doubt it is on my side or module, because 80% of my jobs finished regularly with same (simple) script code in last year. So error appears in minority of cases, but is very annoying. Also I can't figure out any regularity in error appearance - sometimes it happens after one hour, sometimes after whole day.. Really don't want to bother you and community with this offtopic here. Thank you again! – multipole Jan 22 '22 at 18:51

0 Answers0