Deploy Ray on custom cluster using Docker containers

Question

This question is taken directly from an issue I have opened on the Ray repository, I hope to get more exposure by posting it also here.

I saw similar questions both on past issues relative to older versions of Ray and similar problems, but as they did not offer a clear setup nor a clear solution, but generally a hacky "by adding this flag it works", I decided to post this question trying to clearly explain every small step from how I am setting up Ray, making the docker files available and the specific commands I run and outputs I receive, other than the hints I managed to collect.

Hoping that this makes for a question worth asking, here it goes.

What is the problem?

Even though all the cluster nodes are available in the dashboard and do not show any error, executing ray-related python code on the head node makes only the head node available, while on the nodes it starts outputting:

WARNING services.py:211 -- Some processes that the driver needs to connect to have not registered with Redis, so retrying. Have you run 'ray start' on this node?

Ray version and other system information (Python version, TensorFlow version, OS): 3.6.5, None (not installed at this time), Ubuntu 18.04

Reproduction

As of the title, I am trying to setup Ray on a custom cluster using Docker containers. The idea is to start getting my feet wet on a small cluster and afterwards when I learn how to use the library, to deploy it on a SLURM cluster (and I have already seen there is a small tutorial on that).

My small setup is detailed in a repository I have created just for this: basically it uses the docker images as provided from this tutorial for the documentation and then installs other tools such as byobu mainly for debugging purposes.

After building the ServerDockerfile I launch the container as follows:

docker run --shm-size=16GB -t --tty --interactive --network host experimenting_on_ray_server

From within the container then I launch ray with:

ray start --head

This will output:

2020-04-15 20:08:05,148 INFO scripts.py:357 -- Using IP address xxx.xxx.xxx.xxx for this node.
2020-04-15 20:08:05,151 INFO resource_spec.py:212 -- Starting Ray with 122.61 GiB memory available for workers and up to 56.56 GiB for objects. You can adjust these settings with ray.init(memory=<bytes>, object_store_memory=<bytes>).
2020-04-15 20:08:05,629 INFO services.py:1148 -- View the Ray dashboard at localhost:8265
2020-04-15 20:08:05,633 WARNING services.py:1470 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 17179869184 bytes available. This may slow down performance! You may be able to free up space by deleting files in /dev/shm or terminating any running plasma_store_server processes. If you are inside a Docker container, you may need to pass an argument with the flag '--shm-size' to 'docker run'.
2020-04-15 20:08:05,669 INFO scripts.py:387 -- 
Started Ray on this node. You can add additional nodes to the cluster by calling

    ray start --address='xxx.xxx.xxx.xxx:53158' --redis-password='5241590000000000'

from the node you wish to add. You can connect a driver to the cluster from Python by running

    import ray
    ray.init(address='auto', redis_password='5241590000000000')

If you have trouble connecting from a different machine, check that your firewall is configured properly. If you wish to terminate the processes that have been started, run

    ray stop

Where xxx.xxx.xxx.xxx is the public IP of this machine, as the Docker container has been started with the --network option. I cannot figure out why the warning appears, as in the ray docker tutorial from the documentation it states Replace <shm-size> with a limit appropriate for your system, for example 512M or 2G, and here I am using 16GB. How much should be enough?

At this point, via SSH port forwarding, I can see that the dashboard is online and shows the following:

Since it all seems nominal, I proceed to build the ClientDockerfile which at this point is in all intents and purposes identical to the server. Then I start it by running:

docker run --shm-size=16GB -t --tty --interactive --network host experimenting_on_ray_client

Now I can run the command provided in the head node to attach another node to the cluster. Hence I execute:

ray start --address='xxx.xxx.xxx.xxx:53158' --redis-password='5241590000000000'

Where again, xxx.xxx.xxx.xxx is the public IP of the machine where I am running the head Docker container with the --network flag.

This command seems to run successfully: if I go to the dashboard now, I can see the second node available. Here xxx.xxx.xxx.xxx is the IP of the head node while yyy.yyy.yyy.yyy is the IP of the worker node.

Finally, I can try to execute some ray code! So I try to execute the code provided in the documentation and in the head node the following code when executed in a python dashboard:

import ray
ray.init(address='auto', redis_password='5241590000000000')

import time

@ray.remote
def f():
    time.sleep(0.01)
    return ray.services.get_node_ip_address()

# Get a list of the IP addresses of the nodes that have joined the cluster.
set(ray.get([f.remote() for _ in range(1000)]))

Outputs:

{'xxx.xxx.xxx.xxx'}

But to my understanding we were expecting:

{'xxx.xxx.xxx.xxx', 'yyy.yyy.yyy.yyy'}

If I run the very same code on the worker node, I get a very different output (or more like, a lack of any output). After executing the first two lines:

import ray
ray.init(address='auto', redis_password='5241590000000000')

I get:

2020-04-15 20:29:53,481 WARNING worker.py:785 -- When connecting to an existing cluster, _internal_config must match the cluster's _internal_config.
2020-04-15 20:29:53,486 WARNING services.py:211 -- Some processes that the driver needs to connect to have not registered with Redis, so retrying. Have you run 'ray start' on this node?
2020-04-15 20:29:54,491 WARNING services.py:211 -- Some processes that the driver needs to connect to have not registered with Redis, so retrying. Have you run 'ray start' on this node?
2020-04-15 20:29:55,496 WARNING services.py:211 -- Some processes that the driver needs to connect to have not registered with Redis, so retrying. Have you run 'ray start' on this node?
2020-04-15 20:29:56,500 WARNING services.py:211 -- Some processes that the driver needs to connect to have not registered with Redis, so retrying. Have you run 'ray start' on this node?
2020-04-15 20:29:57,505 WARNING services.py:211 -- Some processes that the driver needs to connect to have not registered with Redis, so retrying. Have you run 'ray start' on this node?
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/ray/python/ray/worker.py", line 802, in init
    connect_only=True)
  File "/ray/python/ray/node.py", line 126, in __init__
    redis_password=self.redis_password)
  File "/ray/python/ray/services.py", line 204, in get_address_info_from_redis
    redis_address, node_ip_address, redis_password=redis_password)
  File "/ray/python/ray/services.py", line 187, in get_address_info_from_redis_helper
    "Redis has started but no raylets have registered yet.")
RuntimeError: Redis has started but no raylets have registered yet.

No additional information is provided in the dashboard, where everything keeps looking nominal. I have tested the reproducibility of the issue numerous times, while I was hoping to have misconfigured something the local network or for the two docker images. The two docker containers run on two different machines in the same local network, that is with an IP that looks like same.same.same.different.

I have also tried to reproduce the error by running the two dockers on the same machine. The issue also appears within this setting.

What other information may I provide that can be of help?

Update 1: found new relevant file.

While searching for the raylet error log file that is present at the path /tmp/ray/session_latest/logs/raylet.err, which was empty both in server and client and both before and after executing the python code, I noticed another error log that might be of interest in the current issue.

The file is present at the position: /tmp/raylet.595a989643d2.invalid-user.log.WARNING.20200416-181435.22, and contains the following:

Log file created at: 2020/04/16 18:14:35
Running on machine: 595a989643d2
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
W0416 18:14:35.525002    22 node_manager.cc:574] Received NodeRemoved callback for an unknown client a4e873ae0f72e58105e16c664b3acdda83f80553.

Update 2: the raylet.out files are not empty

Even though the raylet.err files are empty on both the client and server, the raylet.out files are not. Here's their content.

Server raylet.out file

I0417 05:50:03.973958    38 stats.h:62] Succeeded to initialize stats: exporter address is 127.0.0.1:8888
I0417 05:50:03.975106    38 redis_client.cc:141] RedisClient connected.
I0417 05:50:03.983482    38 redis_gcs_client.cc:84] RedisGcsClient Connected.
I0417 05:50:03.984493    38 service_based_gcs_client.cc:63] ServiceBasedGcsClient Connected.
I0417 05:50:03.985126    38 grpc_server.cc:64] ObjectManager server started, listening on port 42295.
I0417 05:50:03.989686    38 grpc_server.cc:64] NodeManager server started, listening on port 44049.

Client raylet.out file

Here is a subset of the file. It presented hundreds of rows, such as these:

I0417 05:50:32.865006    23 node_manager.cc:734] [HeartbeatAdded]: received heartbeat from unknown client id 93a2294c6c338410485494864268d8eeeaf2ecc5
I0417 05:50:32.965395    23 node_manager.cc:734] [HeartbeatAdded]: received heartbeat from unknown client id 93a2294c6c338410485494864268d8eeeaf2ecc5

Seems like raylet inside your worker container is not registered with Redis in the head node container. Can you check the raylet logs inside the worker node container and see if there are any errors? It should be in `/tmp/ray/session_latest/logs/raylet.err`. You can also probably check the raylet.out output as well. — Sang, Apr 16 '20 at 17:52
So, the file at the path you have given exists but is empty on both the client and the server both before and after executing the python code in question. While searching for that file though, I found another log in the `/tmp/` folder that might relate to the issue (added in updated question). I have verified that it is created only on the child node after connecting to the cluster, even before running the python code. — Luca Cappelletti, Apr 16 '20 at 18:19
So, when new nodes are added to cluster, its raylet (process running in every node in the cluster) should be registered to Redis, which is a global metadata store of the Ray cluster. Seems like the problem is that there was an issue in this process. Based on the fact that your ralyet receives noderemoved request, I assume raylet in the head node sends some requests because of some problems occuring there. Would you mind also showing me the raylet logs in the head node? — Sang, Apr 16 '20 at 23:26
Please take a look at this issue. It can be a firewall problem. https://github.com/ray-project/ray/issues/8054 — Sang, Apr 17 '20 at 00:16
Hi! Both the `raylet.err` logs (in client and server) are empty. I have added part (to avoid cluttering the question) of the `raylet.out` files. I don't believe that a firewall problem can apply to this instance, as I have also reproduced the issue when running two docker containers within the same machine. — Luca Cappelletti, Apr 17 '20 at 05:51
I really think the problem is that some of ports that are supposed to be open are blocked somehow, and Ray couldn't catch that problem. As a result, the raylet is not registered to Redis. Can you make sure all ports are open and reachable from head node and worker nodes? You will be able to specify the port range in the next version of Ray (0.8.5), but let's test in this way first. — Sang, Apr 17 '20 at 16:20
The reason why I am asking you if all ports are open is because raylet worker processes communicate to each other directly through its open ports. Their ports range can vary (it can be like 50000+). You will be able to specify port range in the latest master https://github.com/ray-project/ray/commit/9f751ff8c4d5d3c4282200ca5dfef9b9e5ff60e1, but it is not available in 0.8.4 yet. — Sang, Apr 17 '20 at 16:22
I could run a port scan between the two Dockers but, if possible, I'd like to target for the expected Raylet port. Where can I find (or specify) the port to use for it? — Luca Cappelletti, Apr 17 '20 at 16:59
I believe they are randomly assigned, but you can specify them here at these options. https://ray.readthedocs.io/en/latest/package-ref.html#the-ray-command-line-api - --node-manager: raylet - --object-manager: plasma store - --redis-port: Redis port (Redis is called GCS). But, it is currently not possible to assign workers port. The commit I sent you above will allow you to specify port ranges, but is only available at the latest master now. — Sang, Apr 17 '20 at 17:53
May I ask if you have a Gitter account or something of the sort? I feel like we could get to the bottom of this issue faster with a more direct chat. — Luca Cappelletti, Apr 17 '20 at 19:18
In the meantime, I have run `netstat -pnltu` and obtained a list of all the services and ports associated with them. I have then tried to run `nc -zv xxx.xxx.xxx.xxxx port` for each port, where the xs are the IP of the server. All requests get answered. — Luca Cappelletti, Apr 17 '20 at 20:20
Sorry for getting back late. 1. For the warning message you received (`WARNING: The object store is using /tmp instead of /dev/shm...`), it can be because 16GB is too big for your shm memory size. You can check out the max shm memory size using this command at Linux (`cat /proc/sys/kernel/shmmax`, source: https://linoxide.com/how-tos/command-to-show-shared-memory-settings/). 2. Can you see if 2 docker containers in the same machine works when you use localhost network for `ray start`? (e.g., in the worker container, `ray start --address='127.0.0.1:53158'`). — Sang, Apr 19 '20 at 22:57
The Dockers work when both containers share the machine local network, but fail when one of them does not share the host network. The ping to the various services works in both situations though. — Luca Cappelletti, Apr 22 '20 at 12:38
hmm.. Then I really believe there are some hidden problems when node-to-node communication happens through dockers. It is kind of hard to narrow down the problem through just stack overflow. — Sang, Apr 24 '20 at 00:34
We can talk more about this issue for instance on Gitter or Telegram or any other medium you'd prefer to have a faster back and forth. — Luca Cappelletti, Apr 24 '20 at 08:00
Why don't you try asking this in the Ray public slack? I am also in there, and there might people who can answer this question better. — Sang, May 07 '20 at 05:15
You might need to request an invitation. You can find it at the Ray Github repository (it is at the bottom of README.md) — Sang, May 07 '20 at 21:45

Deploy Ray on custom cluster using Docker containers

What is the problem?

Reproduction

Update 1: found new relevant file.

Update 2: the raylet.out files are not empty

0 Answers0