I have been attempting to setup the torque scheduler for a small cluster. I followed the steps to setup the scheduler from http://docs.adaptivecomputing.com/torque/archive/3-0-2/1.2configuring_torque_on_server.php
However when i attempt
qterm -t quick
I get the following error
$ sudo qterm -t quick
Unable to communicate with Terra(192.168.1.25)
Cannot connect to specified server host 'Terra'.
qterm: could not connect to server '' (111) Connection refused
but the server starts just fine. However when I attempt to run a command that runs on multiple nodes such as
qsub -l nodes=2:ppn=4 /home/user/scripts/someScript
it prints out somethign like
7.Terra
where Terra is the name of the head node, but is also a node in the cluster. This isn't the problem. The problem is that it does not run. nor does it have any output anywhere :/
The torque server log: https://ptpb.pw/EaKo
The terra node log: https://ptpb.pw/9w5M
and the Marte log: https://ptpb.pw/o4PT
I can get it to run with a pbs script but only with one node....
#!/bin/bash
#PBS -l pmem=1gb,nodes=1:ppn=4
#PBS -m abe
cd Documents/
wc -l largeTest.csv
Here is the ouput of qstat
after submitting a job
Job ID Name User Time Use S
Queue
------------------------- ---------------- --------------- -------- - -----
16.Terra testPerformance justin 0 R batch
the output of pbsnodes -a
Terra
state = free
power_state = Running
np = 4
properties = Tower
ntype = cluster
status = opsys=linux,uname=Linux Terra 4.17.14-arch1-1-ARCH #1 SMP PREEMPT Thu Aug 9 11:56:50 UTC 2018 x86_64,sessions=11525 22029,nsessions=2,nusers=1,idletime=57964,totmem=8111556kb,availmem=7539284kb,physmem=8111556kb,ncpus=4,loadave=0.00,gres=,netload=30570521372,state=free,varattr= ,cpuclock=Fixed,macaddr=e0:3f:49:44:72:20,version=6.1.1.1,rectime=1534937388,jobs=
mom_service_port = 15002
mom_manager_port = 15003
gpus = 1
Marte
state = free
power_state = Running
np = 4
properties = NFSServer
ntype = cluster
status = opsys=linux,uname=Linux Marte 4.18.1-arch1-1-ARCH #1 SMP PREEMPT Wed Aug 15 21:11:55 UTC 2018 x86_64,sessions=366 556 563,nsessions=3,nusers=2,idletime=58140,totmem=7043404kb,availmem=6703808kb,physmem=7043404kb,ncpus=4,loadave=0.02,gres=,netload=36500663511,state=free,varattr= ,cpuclock=Fixed,macaddr=c8:5b:76:4a:65:91,version=6.1.1.1,rectime=1534937359,jobs=
mom_service_port = 15002
mom_manager_port = 15003
and the /var/spool/torque/server_priv/nodes
Terra np=4 gpus=1 Tower
Marte np=4 NFSServer
Edit: Here are the most recent logs as well
Mom Log for Node: https://ptpb.pw/DhKi
Mom Log for head node: https://ptpb.pw/MTlD
and the server log: https://ptpb.pw/HPkE