4

I have access to a cluster of nodes and my understanding was that once I started ray on each node with the same redis address the head node would have access to all of the resources of all of the nodes.

main script:

export LC_ALL=en_US.utf-8
export LANG=en_US.utf-8 # required for using python 3 with click
source activate rllab3

redis_address="$(hostname --ip-address)"
echo $redis_address
redis_address="$redis_address:59465"
~/.conda/envs/rllab3/bin/ray start --head --redis-port=59465

for host in $(srun hostname | grep -v $(hostname)); do 
    ssh $host setup_node.sh $redis_address
done

python test_multi_node.py $redis_address

setup_node.sh is

export LC_ALL=en_US.utf-8
export LANG=en_US.utf-8

source activate rllab3

echo "redis address is $1"

~/.conda/envs/rllab3/bin/ray start --redis-address=$1

and

test_multi_node.py is

import ray
import time
import argparse

parser = argparse.ArgumentParser(description = "ray multinode test")
parser.add_argument("redis_address", type=str, help="ip:port")
args = parser.parse_args()
print("in python script redis addres is:", args.redis_address)

ray.init(redis_address=args.redis_address)
print("resources:", ray.services.check_and_update_resources(None, None, None))

@ray.remote
def f():
    time.sleep(0.01)
    return ray.services.get_node_ip_address()

# Get a list of the IP addresses of the nodes that have joined the cluster.
print(set(ray.get([f.remote() for _ in range(10000)])))

Ray seems to successfully start on all nodes and the python script prints out as many IP addresses as I have nodes (and they are correct). However when printing the resources it only has the resources of one node.

How can I make ray have access to all of the resources of all of the nodes? I must have a fundamental misunderstanding because I thought the point of setting up ray on the other nodes was to give it access to all of their resources.

According to this ray should autodetect the resources on a new node so I don't know what's going on here.

Lubed Up Slug
  • 168
  • 1
  • 11

2 Answers2

3

The method ray.services.check_and_update_resources is an internal method and not intended to be exposed. You can check the cluster resources with ray.global_state.cluster_resources() as well as ray.global_state.client_table().

Robert Nishihara
  • 3,276
  • 16
  • 17
  • 1
    Why does [the example](https://ray.readthedocs.io/en/latest/using-ray-on-a-cluster.html#starting-ray-on-each-machine) given in the documentation not use either of these methods to check the setup was correct? Is there other documentation on this that I am missing? – Lubed Up Slug May 06 '19 at 05:29
  • 1
    There's no good reason for that. I think that would be a nice improvement to the documentation. – Robert Nishihara May 06 '19 at 07:29
  • 1
    Thank you. On ray 0.9+ I was able to use ray.cluster_resources() AND ray.nodes() see docs: 'Inspect the Cluster State' – DMTishler Mar 25 '20 at 04:44
3

On newer versions of Ray (0.8.2+ as tested here) we can try:

Inspect the Cluster State https://ray.readthedocs.io/en/latest/package-ref.html#inspect-the-cluster-state Example output for a single machine system:

print(ray.nodes())
"""[{'NodeID': <ID>, 'Alive': True, 'NodeManagerAddress': <IP>,
'NodeManagerHostname': <HOSTNAME>, 'NodeManagerPort': <PORT>,
'ObjectManagerPort': 32799, 'ObjectStoreSocketName':
'/tmp/ray/session_2020-03-25_00-42-55_127146_1246/sockets/plasma_store',
'RayletSocketName':
'/tmp/ray/session_2020-03-25_00-42-55_127146_1246/sockets/raylet',
'Resources': {'node:<IP>': 1.0, 'GPU': 1.0, 'CPU': 8.0, 'memory':
160.0, 'object_store_memory': 55.0}, 'alive': True}]"""

Resource Information https://ray.readthedocs.io/en/latest/advanced.html As mentioned in other solutions, items like cluster_resources, or available_resources, can fetch resource info specifically:

print(ray.cluster_resources()) 
# {'node:<IP>': 1.0, 'GPU': 1.0, 'CPU': 8.0, 'memory': 160.0, 'object_store_memory': 55.0}
DMTishler
  • 501
  • 1
  • 6
  • 12