This question is taken directly from an issue I have opened on the Ray repository, I hope to get more exposure by posting it also here.
I saw similar questions both on past issues relative to older versions of Ray and similar problems, but as they did not offer a clear setup nor a clear solution, but generally a hacky "by adding this flag it works", I decided to post this question trying to clearly explain every small step from how I am setting up Ray, making the docker files available and the specific commands I run and outputs I receive, other than the hints I managed to collect.
Hoping that this makes for a question worth asking, here it goes.
What is the problem?
Even though all the cluster nodes are available in the dashboard and do not show any error, executing ray-related python code on the head node makes only the head node available, while on the nodes it starts outputting:
WARNING services.py:211 -- Some processes that the driver needs to connect to have not registered with Redis, so retrying. Have you run 'ray start' on this node?
Ray version and other system information (Python version, TensorFlow version, OS): 3.6.5, None (not installed at this time), Ubuntu 18.04
Reproduction
As of the title, I am trying to setup Ray on a custom cluster using Docker containers. The idea is to start getting my feet wet on a small cluster and afterwards when I learn how to use the library, to deploy it on a SLURM cluster (and I have already seen there is a small tutorial on that).
My small setup is detailed in a repository I have created just for this: basically it uses the docker images as provided from this tutorial for the documentation and then installs other tools such as byobu mainly for debugging purposes.
After building the ServerDockerfile I launch the container as follows:
docker run --shm-size=16GB -t --tty --interactive --network host experimenting_on_ray_server
From within the container then I launch ray with:
ray start --head
This will output:
2020-04-15 20:08:05,148 INFO scripts.py:357 -- Using IP address xxx.xxx.xxx.xxx for this node.
2020-04-15 20:08:05,151 INFO resource_spec.py:212 -- Starting Ray with 122.61 GiB memory available for workers and up to 56.56 GiB for objects. You can adjust these settings with ray.init(memory=<bytes>, object_store_memory=<bytes>).
2020-04-15 20:08:05,629 INFO services.py:1148 -- View the Ray dashboard at localhost:8265
2020-04-15 20:08:05,633 WARNING services.py:1470 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 17179869184 bytes available. This may slow down performance! You may be able to free up space by deleting files in /dev/shm or terminating any running plasma_store_server processes. If you are inside a Docker container, you may need to pass an argument with the flag '--shm-size' to 'docker run'.
2020-04-15 20:08:05,669 INFO scripts.py:387 --
Started Ray on this node. You can add additional nodes to the cluster by calling
ray start --address='xxx.xxx.xxx.xxx:53158' --redis-password='5241590000000000'
from the node you wish to add. You can connect a driver to the cluster from Python by running
import ray
ray.init(address='auto', redis_password='5241590000000000')
If you have trouble connecting from a different machine, check that your firewall is configured properly. If you wish to terminate the processes that have been started, run
ray stop
Where xxx.xxx.xxx.xxx is the public IP of this machine, as the Docker container has been started with the --network
option. I cannot figure out why the warning appears, as in the ray docker tutorial from the documentation it states Replace <shm-size> with a limit appropriate for your system, for example 512M or 2G
, and here I am using 16GB. How much should be enough?
At this point, via SSH port forwarding, I can see that the dashboard is online and shows the following:
Since it all seems nominal, I proceed to build the ClientDockerfile which at this point is in all intents and purposes identical to the server. Then I start it by running:
docker run --shm-size=16GB -t --tty --interactive --network host experimenting_on_ray_client
Now I can run the command provided in the head node to attach another node to the cluster. Hence I execute:
ray start --address='xxx.xxx.xxx.xxx:53158' --redis-password='5241590000000000'
Where again, xxx.xxx.xxx.xxx
is the public IP of the machine where I am running the head Docker container with the --network
flag.
This command seems to run successfully: if I go to the dashboard now, I can see the second node available. Here xxx.xxx.xxx.xxx
is the IP of the head node while yyy.yyy.yyy.yyy
is the IP of the worker node.
Finally, I can try to execute some ray code! So I try to execute the code provided in the documentation and in the head node the following code when executed in a python dashboard:
import ray
ray.init(address='auto', redis_password='5241590000000000')
import time
@ray.remote
def f():
time.sleep(0.01)
return ray.services.get_node_ip_address()
# Get a list of the IP addresses of the nodes that have joined the cluster.
set(ray.get([f.remote() for _ in range(1000)]))
Outputs:
{'xxx.xxx.xxx.xxx'}
But to my understanding we were expecting:
{'xxx.xxx.xxx.xxx', 'yyy.yyy.yyy.yyy'}
If I run the very same code on the worker node, I get a very different output (or more like, a lack of any output). After executing the first two lines:
import ray
ray.init(address='auto', redis_password='5241590000000000')
I get:
2020-04-15 20:29:53,481 WARNING worker.py:785 -- When connecting to an existing cluster, _internal_config must match the cluster's _internal_config.
2020-04-15 20:29:53,486 WARNING services.py:211 -- Some processes that the driver needs to connect to have not registered with Redis, so retrying. Have you run 'ray start' on this node?
2020-04-15 20:29:54,491 WARNING services.py:211 -- Some processes that the driver needs to connect to have not registered with Redis, so retrying. Have you run 'ray start' on this node?
2020-04-15 20:29:55,496 WARNING services.py:211 -- Some processes that the driver needs to connect to have not registered with Redis, so retrying. Have you run 'ray start' on this node?
2020-04-15 20:29:56,500 WARNING services.py:211 -- Some processes that the driver needs to connect to have not registered with Redis, so retrying. Have you run 'ray start' on this node?
2020-04-15 20:29:57,505 WARNING services.py:211 -- Some processes that the driver needs to connect to have not registered with Redis, so retrying. Have you run 'ray start' on this node?
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/ray/python/ray/worker.py", line 802, in init
connect_only=True)
File "/ray/python/ray/node.py", line 126, in __init__
redis_password=self.redis_password)
File "/ray/python/ray/services.py", line 204, in get_address_info_from_redis
redis_address, node_ip_address, redis_password=redis_password)
File "/ray/python/ray/services.py", line 187, in get_address_info_from_redis_helper
"Redis has started but no raylets have registered yet.")
RuntimeError: Redis has started but no raylets have registered yet.
No additional information is provided in the dashboard, where everything keeps looking nominal. I have tested the reproducibility of the issue numerous times, while I was hoping to have misconfigured something the local network or for the two docker images. The two docker containers run on two different machines in the same local network, that is with an IP that looks like same.same.same.different
.
I have also tried to reproduce the error by running the two dockers on the same machine. The issue also appears within this setting.
What other information may I provide that can be of help?
Update 1: found new relevant file.
While searching for the raylet error log file that is present at the path /tmp/ray/session_latest/logs/raylet.err
, which was empty both in server and client and both before and after executing the python code, I noticed another error log that might be of interest in the current issue.
The file is present at the position: /tmp/raylet.595a989643d2.invalid-user.log.WARNING.20200416-181435.22
, and contains the following:
Log file created at: 2020/04/16 18:14:35
Running on machine: 595a989643d2
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
W0416 18:14:35.525002 22 node_manager.cc:574] Received NodeRemoved callback for an unknown client a4e873ae0f72e58105e16c664b3acdda83f80553.
Update 2: the raylet.out files are not empty
Even though the raylet.err
files are empty on both the client and server, the raylet.out
files are not. Here's their content.
raylet.out
file
I0417 05:50:03.973958 38 stats.h:62] Succeeded to initialize stats: exporter address is 127.0.0.1:8888
I0417 05:50:03.975106 38 redis_client.cc:141] RedisClient connected.
I0417 05:50:03.983482 38 redis_gcs_client.cc:84] RedisGcsClient Connected.
I0417 05:50:03.984493 38 service_based_gcs_client.cc:63] ServiceBasedGcsClient Connected.
I0417 05:50:03.985126 38 grpc_server.cc:64] ObjectManager server started, listening on port 42295.
I0417 05:50:03.989686 38 grpc_server.cc:64] NodeManager server started, listening on port 44049.
Client raylet.out
file
Here is a subset of the file. It presented hundreds of rows, such as these:
I0417 05:50:32.865006 23 node_manager.cc:734] [HeartbeatAdded]: received heartbeat from unknown client id 93a2294c6c338410485494864268d8eeeaf2ecc5
I0417 05:50:32.965395 23 node_manager.cc:734] [HeartbeatAdded]: received heartbeat from unknown client id 93a2294c6c338410485494864268d8eeeaf2ecc5