Worker node-status on a Ray EC2 cluster: update-failed

Question

I now have a Ray cluster working on EC2 (Ubuntu 16.04) with a c4.8xlarge master node and one identical worker. I wanted to check whether multi-threading was being used, so I ran tests to time increasing numbers (n) of the same 9-second task. Since the instance has 18 CPUs, I expected to see the job taking about 9s for up to n<=35 (assuming one CPU for the cluster management) and then either a fault, or an increase to about 18 sec when switching to 36 vCPUs per node.

Instead, the cluster handled up to only 14 tasks in parallel and then the execution time jumped to 40s and continued to increase for increasing n. When I tried a c4xlarge master (4 CPUs), the times were directly proportional to n, i.e. they were running serially. So I surmise that the master actually requires 4 CPUs for the system, and that the worker node is not being used at all. However, if I add a second worker, the times for n>14 are about 40s less that without it. I also tried a value for target_utilization_factor less than 1.0, but that made no difference.

There were no reported errors, but I did notice that the ray-node-status for the worker in the EC2 Instances console was "update-failed". Is this significant? Can anyone enlighten me about this behaviour?

To see whether things are being scheduled in parallel, I'd suggest looking at the Ray timeline. Run `ray timeline` on the command line (on one of the nodes) and then load the resulting JSON file in chrome://tracing in the Chrome web browser. — Robert Nishihara, May 13 '19 at 05:44
What a handy trace tool! The screenshot shows the result of running n=[8,18,28,38]. There are 36 workers, so each should be one real CPU - I cannot tell which belongs to the master and which to the worker. However only the first test runs at the ~9s for all workers. Do you think that this is a resource problem: it is possible that the task is quite hungry for RAM and, if each worker is competing for common memory, that could be causing a bottleneck? I was assuming that a compute-optimized instance would have more that enough RAM on each worker. — Nick Mint, May 14 '19 at 04:41
I increased the EBS volume size to 100Gb which improved things, but still didn't allow all tasks to run in parallel. I then increased the worker nodes to 3, and ran n=[18,36,54,72] giving times ~[11, 22, 27, 36]; however, the trace showed that this was still only using 36 CPUs! I removed the with tmp.TemporaryDirectory() as path: block (see the code in https://stackoverflow.com/questions/55912710/error-while-initializing-ray-on-an-ec2-master-node), but that made no difference. I also tried increasing EBS to 200Gb: again, no real difference. — Nick Mint, May 14 '19 at 22:48

score 0 · Answer 1 · answered May 27 '19 at 02:32

The cluster did not appear to be using the workers, so the trace is showing only 18 actual cpus dealing with the task. The monitor (ray exec ray_conf.yaml 'tail -n 100 -f /tmp/ray/session_/logs/monitor') identified that the "update-failed" is significant in that the setup commands, called by the ray updater.py, were failing on the worker nodes. Specifically, it was the attempt to install the C build-essential compiler package on them that, presumably, exceeded the worker memory allocation. I was only doing this in order to suppress a "setproctitle" installation warning - which I now understand can be safely ignored anyway.

Worker node-status on a Ray EC2 cluster: update-failed

1 Answers1

Linked