The HPX getting started tutorial assumes you are using PBS or slurm. These may be quite common in the HPC community but as a developer I'm more used to the scenario of here are a couple of machines you can install stuff on.
It's not immediately obvious whether a scheduler like slurm is required to leverage multiple physical machines or just convenient for managing a cluster.
I know you can simulate multiple localities using the -l flag when you run an HPX application (see for example this question) what I want is to run the same application on 2 nodes and have them communicate with each other.
What is the minimum needed to tell HPX:
Here is one other machine with this IP address to which you can send tasks?
Alternatively what is the minimum slurm configuration to reach this stage?
Installing slurm was easy finding a simple 2 node example less so.Though this link to a podcast may help
I'm also assuming HPX's parcel port will just work over TCP without installing anything extra (e.g. MPI). Is this correct?
Update I think I'm getting closer but I'm still missing something. Firstly I'm using the hello_world example. Could it be that it is too simple for the 2 node test? I am hoping for similar output to running 2 localities on the same node:
APP=$HPX/bin/hello_world
$APP --hpx:node 0 --hpx:threads 4 -l2 &
$APP --hpx:node 1 --hpx:threads 4
sample output:
hello world from OS-thread 2 on locality 0 hello world from OS-thread 0 on locality 0 hello world from OS-thread 1 on locality 1 hello world from OS-thread 3 on locality 1 hello world from OS-thread 2 on locality 1 hello world from OS-thread 1 on locality 0 hello world from OS-thread 0 on locality 1 hello world from OS-thread 3 on locality 0
but when I try to remote it both processes hang:
$APP --hpx:localities=2 --hpx:agas=$NODE0:7910 --hpx:hpx=$NODE0:7910 --hpx:threads 4 &
ssh $NODE1 $APP --hpx:localities=2 --hpx:agas=$NODE0:7910 --hpx.hpx=$NODE1:7910 --hpx:threads 4
I have opened port 7910 on both machines. The path to $APP is the same on both nodes. I'm not sure how to test whether the second process is talking to the agas server.
If I use "--hpx:debug-agas-log=agas.log" and "--hpx:debug-hpx-log=hpx.log" & I get:
>cat hpx.log (T00000000/----------------.----/----------------) P--------/----------------.---- 14:18.29.042 [0000000000000001] [ERR] created exception: HPX(success) (T00000000/----------------.----/----------------) P--------/----------------.---- 14:18.29.042 [0000000000000002] [ERR] created exception: HPX(success)
on both machines. I'm not sure how to interpret this.
I've tried a few other options such as --hpx:run-agas-server (I think that is possibly implied by using --hpx:agas=)
I also tried
ssh $NODE1 $APP --hpx:nodes="$NODE0 $NODE1" &
$APP --hpx:nodes="$NODE0 $NODE1"
as suggested by the other (now deleted?) answer with no luck.
update 2
I thought it might be a firewall issue even with the firewall disabled nothing seems to happen. I've tried running a trace on the system calls but there is nothing obvious:
echo "start server on agas master: node0=$NODE0"
strace -o node0.strace $APP \
--hpx:localities=2 --hpx:agas=$NODE0:7910 --hpx:hpx=$NODE0:7910 --hpx:threads 4 &
cat agas.log hpx.log
echo "start worker on slave: node1=$NODE1"
ssh $NODE1 \
strace -o node1.strace $APP \
--hpx:worker --hpx:agas=$NODE0:7910 --hpx.hpx=$NODE1:7910
echo "done"
exit 0
tail of node0.strace:
15:13:31 bind(7, {sa_family=AF_INET, sin_port=htons(7910), sin_addr=inet_addr("172.29.0.160")}, 16) = 0 15:13:31 listen(7, 128) = 0 15:13:31 ioctl(7, FIONBIO, [1]) = 0 15:13:31 accept(7, 0, NULL) = -1 EAGAIN (Resource temporarily unavailable) ... 15:13:32 mprotect(0x7f12b2bff000, 4096, PROT_NONE) = 0 15:13:32 clone(child_stack=0x7f12b33feef0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x7f12b33ff9d0, tls=0x7f12b33ff700, child_tidptr=0x7f12b33ff9d0) = 22394 15:13:32 futex(0x7ffe2c5df60c, FUTEX_WAIT_PRIVATE, 1, NULL) = 0 15:13:32 futex(0x7ffe2c5df5e0, FUTEX_WAKE_PRIVATE, 1) = 0 15:13:32 futex(0x7ffe2c5df4b4, FUTEX_WAIT_PRIVATE, 1, NULL
tail of node1.strace:
6829 15:13:32 bind(7, {sa_family=AF_INET, sin_port=htons(7910), sin_addr=inet_addr("127.0.0.1")}, 16) = 0 16829 15:13:32 listen(7, 128) = 0 16829 15:13:32 ioctl(7, FIONBIO, [1]) = 0 16829 15:13:32 accept(7, 0, NULL) = -1 EAGAIN (Resource temporarily unavailable) 16829 15:13:32 uname({sys="Linux", node="kmlwg-tddamstest3.grpitsrv.com", ...}) = 0 16829 15:13:32 eventfd2(0, O_NONBLOCK|O_CLOEXEC) = 8 16829 15:13:32 epoll_create1(EPOLL_CLOEXEC) = 9 16829 15:13:32 timerfd_create(CLOCK_MONOTONIC, 0x80000 /* TFD_??? */) = 10 16829 15:13:32 epoll_ctl(9, EPOLL_CTL_ADD, 8, {EPOLLIN|EPOLLERR|EPOLLET, {u32=124005464, u64=140359655238744}}) = 0 16829 15:13:32 write(8, "\1\0\0\0\0\0\0\0", 8) = 8 16829 15:13:32 epoll_ctl(9, EPOLL_CTL_ADD, 10, {EPOLLIN|EPOLLERR, {u32=124005476, u64=140359655238756}}) = 0 16829 15:13:32 futex(0x7fa8006f2d24, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7fa8006f2d20, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 16830 15:13:32 ) = 0 16829 15:13:32 socket(PF_INET, SOCK_STREAM, IPPROTO_TCP 16830 15:13:32 futex(0x7fa8076432f0, FUTEX_WAKE_PRIVATE, 1) = 0 16829 15:13:32 ) = 11 16830 15:13:32 epoll_wait(9, 16829 15:13:32 epoll_ctl(9, EPOLL_CTL_ADD, 11, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP|EPOLLET, {u32=124362176, u64=140359655595456}} 16830 15:13:32 {{EPOLLIN, {u32=124005464, u64=140359655238744}}}, 128, -1) = 1 16829 15:13:32 ) = 0 16830 15:13:32 epoll_wait(9, 16829 15:13:32 connect(11, {sa_family=AF_INET, sin_port=htons(7910), sin_addr=inet_addr("172.29.0.160")}, 16 16830 15:13:32 {{EPOLLHUP, {u32=124362176, u64=140359655595456}}}, 128, -1) = 1 16830 15:13:32 epoll_wait(9,
If I do an strace -f on the master its child process loops doing something like this:
22050 15:12:46 socket(PF_INET, SOCK_STREAM, IPPROTO_TCP) = 12 22050 15:12:46 epoll_ctl(5, EPOLL_CTL_ADD, 12, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP|EPOLLET, {u32=2395115776, u64=140516545171712}}) = 0 22041 15:12:46 {{EPOLLHUP, {u32=2395115776, u64=140516545171712}}}, 128, -1) = 1 22050 15:12:46 connect(12, {sa_family=AF_INET, sin_port=htons(7910), sin_addr=inet_addr("127.0.0.1")}, 16 22041 15:12:46 epoll_wait(5, 22050 15:12:46 ) = -1 ECONNREFUSED (Connection refused) 22041 15:12:46 {{EPOLLHUP, {u32=2395115776, u64=140516545171712}}}, 128, -1) = 1 22050 15:12:46 futex(0x7fcc9cc20504, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 1703, {1455808366, 471644000}, ffffffff 22041 15:12:46 epoll_wait(5, 22050 15:12:46 ) = -1 ETIMEDOUT (Connection timed out) 22050 15:12:46 futex(0x7fcc9cc204d8, FUTEX_WAKE_PRIVATE, 1) = 0 22050 15:12:46 close(12) = 0 22050 15:12:46 socket(PF_INET, SOCK_STREAM, IPPROTO_TCP) = 12 22050 15:12:46 epoll_ctl(5, EPOLL_CTL_ADD, 12, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP|EPOLLET, {u32=2395115776, u64=140516545171712}}) = 0 22050 15:12:46 connect(12, {sa_family=AF_INET, sin_port=htons(7910), sin_addr=inet_addr("127.0.0.1")}, 16 22041 15:12:46 {{EPOLLHUP, {u32=2395115776, u64=140516545171712}}}, 128, -1) = 1 22050 15:12:46 ) = -1 ECONNREFUSED (Connection refused) 22041 15:12:46 epoll_wait(5, 22050 15:12:46 futex(0x7fcc9cc20504, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 1705, {1455808366, 572608000}, ffffffff 22041 15:12:46 {{EPOLLHUP, {u32=2395115776, u64=140516545171712}}}, 128, -1) = 1
Update 3
The astute of you may have noticed that in update 2 I accidentally wrote --hpx.hpx instead of --hpx:hpx. Guess what! Changing that fixed it. So technically the first answer was correct and I'm just dumb. I would have expected an error from the command line options parser but I guess when you're making a massively parallel runtime you can't have everything :).
Thanks for the help everyone.