I am trying to do distributed HPO on a Slurm cluster but ray does not detect the GPUs correctly.
I have a head node with only CPUs that is only supposed to run the schduler, and X identical workers nodes with 4 GPUs each, but ray only detects the full 4 on a single node and one GPU on all the others.
This is my cluster setup:
ray start --head --temp-dir /tmp/UID/ray/ --node-ip-address=10.13.22.34 --port=6374 --block --dashboard-port 3232 --dashboard-host 0.0.0.0 --num-cpus 2 &
and then the worker nodes within an sbatch script:
srun --gres=gpu:4 --nodes=1 --ntasks=1 -w "\${node_i}" ray start --address 10.13.22.34:6374 --block &
The output of ray.nodes()
and ray.cluster_resources()
:
[{'NodeID': '3b793f590f23b9a3f8eb4da86f870a2c6a20967ea1ea1093945fb2d9', 'Alive': True, 'NodeManagerAddress': '10.13.31.131', 'NodeManagerHostname': 'jwb0801.juwels', 'NodeManagerPort': 44655, 'ObjectManagerPort': 40891, 'ObjectStore
SocketName': '/tmp/ray/session_2023-04-05_09-43-27_453188_1518/sockets/plasma_store', 'RayletSocketName': '/tmp/ray/session_2023-04-05_09-43-27_453188_1518/sockets/raylet', 'MetricsExportPort': 43462, 'NodeName': '10.13.31.131', 'al
ive': True, 'Resources': {'CPU': 96.0, 'memory': 359211870618.0, 'GPU': 4.0, 'node:10.13.31.131': 1.0, 'object_store_memory': 153947944550.0, 'accelerator_type:A100': 1.0}}, {'NodeID': '5f803a7dd410074ea9d3fb3058b3ab3044011c54536c78
08bd9a947c', 'Alive': True, 'NodeManagerAddress': '10.13.22.34', 'NodeManagerHostname': 'jwlogin24.juwels', 'NodeManagerPort': 34927, 'ObjectManagerPort': 41163, 'ObjectStoreSocketName': '/tmp/UID/ray/session_2023-04-05_09-43-27_4
53188_1518/sockets/plasma_store', 'RayletSocketName': '/tmp/UID/ray/session_2023-04-05_09-43-27_453188_1518/sockets/raylet', 'MetricsExportPort': 60382, 'NodeName': '10.13.22.34', 'alive': True, 'Resources': {'memory': 21657799987
2.0, 'CPU': 2.0, 'node:10.13.22.34': 1.0, 'object_store_memory': 97104857088.0}}, {'NodeID': 'cf76300b5a3d269db3f859ab1b3f87ce03d8ebd344a37bd7421c2dd0', 'Alive': True, 'NodeManagerAddress': '10.13.31.137', 'NodeManagerHostname': 'jw
b0807.juwels', 'NodeManagerPort': 34425, 'ObjectManagerPort': 34533, 'ObjectStoreSocketName': '/tmp/ray/session_2023-04-05_09-43-27_453188_1518/sockets/plasma_store', 'RayletSocketName': '/tmp/ray/session_2023-04-05_09-43-27_453188_
1518/sockets/raylet', 'MetricsExportPort': 62643, 'NodeName': '10.13.31.137', 'alive': True, 'Resources': {'accelerator_type:A100': 1.0, 'object_store_memory': 154205725900.0, 'CPU': 96.0, 'GPU': 1.0, 'memory': 359813360436.0, 'node
:10.13.31.137': 1.0}}]
{'CPU': 194.0, 'GPU': 5.0, 'accelerator_type:A100': 2.0, 'memory': 935603230926.0, 'object_store_memory': 405258527538.0, 'node:10.13.31.131': 1.0, 'node:10.13.22.34': 1.0, 'node:10.13.31.137': 1.0}
The tuner I am running:
res = cluster_resources()
num_gpu=int(res["GPU"])
tuner = Tuner(tune.with_resources(
trainable=fun_to_tune,
resources = tune.PlacementGroupFactory([{"CPU": 0, "GPU": 1}])),
param_space = parameters,
run_config = RunConfig(name, verbose=3,
local_dir= os.path.join(os.environ["SCR"],"ray_results"),
progress_reporter = rep,
failure_config=FailureConfig(fail_fast=True)),
tune_config = TuneConfig(metric = "mean_JSD_impr",mode = "max",
num_samples = -1, time_budget_s = 3600*4,
trial_dirname_creator = mk_tname,
max_concurrent_trials=num_gpu))
tuner.fit()
This launches 5 concurrent trials because ray sees 5 of the 8 available GPUs. All 5 run successfully. I really need all 4 GPUS on each node though, so how can I fix this?