8

I have been attempting to setup the torque scheduler for a small cluster. I followed the steps to setup the scheduler from http://docs.adaptivecomputing.com/torque/archive/3-0-2/1.2configuring_torque_on_server.php

However when i attempt

qterm -t quick

I get the following error

$ sudo qterm -t quick
Unable to communicate with Terra(192.168.1.25)
Cannot connect to specified server host 'Terra'.
qterm: could not connect to server '' (111) Connection refused 

but the server starts just fine. However when I attempt to run a command that runs on multiple nodes such as

qsub -l nodes=2:ppn=4 /home/user/scripts/someScript

it prints out somethign like

7.Terra

where Terra is the name of the head node, but is also a node in the cluster. This isn't the problem. The problem is that it does not run. nor does it have any output anywhere :/

The torque server log: https://ptpb.pw/EaKo

The terra node log: https://ptpb.pw/9w5M

and the Marte log: https://ptpb.pw/o4PT

I can get it to run with a pbs script but only with one node....

#!/bin/bash
#PBS -l pmem=1gb,nodes=1:ppn=4
#PBS -m abe
cd Documents/
wc -l largeTest.csv

Here is the ouput of qstat after submitting a job

Job ID                    Name             User            Time Use S 
Queue
------------------------- ---------------- --------------- -------- - -----
16.Terra                   testPerformance  justin                 0 R batch      

the output of pbsnodes -a

Terra
 state = free
 power_state = Running
 np = 4
 properties = Tower
 ntype = cluster
 status = opsys=linux,uname=Linux Terra 4.17.14-arch1-1-ARCH #1 SMP PREEMPT Thu Aug 9 11:56:50 UTC 2018 x86_64,sessions=11525 22029,nsessions=2,nusers=1,idletime=57964,totmem=8111556kb,availmem=7539284kb,physmem=8111556kb,ncpus=4,loadave=0.00,gres=,netload=30570521372,state=free,varattr= ,cpuclock=Fixed,macaddr=e0:3f:49:44:72:20,version=6.1.1.1,rectime=1534937388,jobs=
 mom_service_port = 15002
 mom_manager_port = 15003
 gpus = 1

Marte
 state = free
 power_state = Running
 np = 4
 properties = NFSServer
 ntype = cluster
 status = opsys=linux,uname=Linux Marte 4.18.1-arch1-1-ARCH #1 SMP PREEMPT Wed Aug 15 21:11:55 UTC 2018 x86_64,sessions=366 556 563,nsessions=3,nusers=2,idletime=58140,totmem=7043404kb,availmem=6703808kb,physmem=7043404kb,ncpus=4,loadave=0.02,gres=,netload=36500663511,state=free,varattr= ,cpuclock=Fixed,macaddr=c8:5b:76:4a:65:91,version=6.1.1.1,rectime=1534937359,jobs=
 mom_service_port = 15002
 mom_manager_port = 15003

and the /var/spool/torque/server_priv/nodes

Terra np=4 gpus=1 Tower
Marte np=4 NFSServer

Edit: Here are the most recent logs as well

Mom Log for Node: https://ptpb.pw/DhKi

Mom Log for head node: https://ptpb.pw/MTlD

and the server log: https://ptpb.pw/HPkE

j-money
  • 509
  • 2
  • 9
  • 32
  • `7.Terra` means a job has been accepted to the queue run by PBS server `Terra` and was assigned number `7`. Your log files contain plenty of other errors you should be looking into. It is hard to pinpoint the root cause of the problem from your description. Do familiarize yourself with `pbs_nodes` and `qstat` commands and post your updates. – Dima Chubarov Aug 21 '18 at 15:08
  • I am wondering if you could point me in the right direction. In the nodes, the most prevalent error seems to be the "Could not contact server...." but the documentation I have been using seems sparse when it comes to troubleshooting :/ – j-money Aug 21 '18 at 16:33
  • Could you please post the output of `qstat` and `pbsnodes -a`, and the contents of `$TORQUE_HOME/server_priv/nodes`? – Dima Chubarov Aug 22 '18 at 11:49
  • @DmitriChubarov I've updated the post. This I think is where I struggle as according to pbsnodes, both nodes are up and running, and per the documenattion the `.../server_priv/nodes` file shouldn't have to be to complex when specifying available resources – j-money Aug 22 '18 at 18:32
  • I see you used the doucmentation for Torque 3.0.2, yet you installed Torque 6.1.1. The installation directions differed quite a bit over the new versions. The 6.1.1 docs are at http://docs.adaptivecomputing.com/torque/6-1-1/adminGuide/help.htm. – Paul Oct 03 '18 at 15:48

0 Answers0