I'm trying to test a simple code using two remote workers. I don't know what is going on and what the error refers to.
The code is simple:
#!/usr/bin/python3
from cuml.dask.cluster import KMeans
from cuml.dask.datasets import make_blobs
from dask.distributed import Client
c = Client("dask-scheduler:8786")
centers = 5
X, _ = make_blobs(n_samples=10000, centers=centers)
k_means = KMeans(n_clusters=centers)
k_means.fit(X)
labels = k_means.predict(X)
It connects but when it tries to execute the cluster algorithm, it throws the following error:
Traceback (most recent call last):
File "test_cuml.py", line 15, in <module>
k_means.fit(X)
File "/opt/conda/envs/rapids/lib/python3.7/site-packages/cuml/common/memory_utils.py", line 93, in cupy_rmm_wrapper
return func(*args, **kwargs)
File "/opt/conda/envs/rapids/lib/python3.7/site-packages/cuml/dask/cluster/kmeans.py", line 161, in fit
comms.init(workers=data.workers)
File "/opt/conda/envs/rapids/lib/python3.7/site-packages/cuml/raft/dask/common/comms.py", line 209, in init
wait=True,
File "/opt/conda/envs/rapids/lib/python3.7/site-packages/distributed/client.py", line 2506, in run
return self.sync(self._run, function, *args, **kwargs)
File "/opt/conda/envs/rapids/lib/python3.7/site-packages/distributed/client.py", line 869, in sync
self.loop, func, *args, callback_timeout=callback_timeout, **kwargs
File "/opt/conda/envs/rapids/lib/python3.7/site-packages/distributed/utils.py", line 332, in sync
raise exc.with_traceback(tb)
File "/opt/conda/envs/rapids/lib/python3.7/site-packages/distributed/utils.py", line 315, in f
result[0] = yield future
File "/opt/conda/envs/rapids/lib/python3.7/site-packages/tornado/gen.py", line 762, in run
value = future.result()
File "/opt/conda/envs/rapids/lib/python3.7/site-packages/distributed/client.py", line 2443, in _run
raise exc.with_traceback(tb)
File "/opt/conda/envs/rapids/lib/python3.7/site-packages/cuml/raft/dask/common/comms.py", line 429, in _func_init_all
_func_init_nccl(sessionId, uniqueId)
File "/opt/conda/envs/rapids/lib/python3.7/site-packages/cuml/raft/dask/common/comms.py", line 484, in _func_init_nccl
n.init(nWorkers, uniqueId, wid)
File "cuml/raft/dask/common/nccl.pyx", line 151, in cuml.raft.dask.common.nccl.nccl.init
The workers are reporting this issue:
distributed.worker - INFO - Run out-of-band function '_func_init_all'
distributed.worker - WARNING - Run Failed
Function: _func_init_all
args: (b'\x95d$\x89\x9beI\xf5\xa7\x8c7M\xe8V[v', b'\x02\x00\xc8\xdd\x8fj\x07\x90\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00', False, {'tcp://dask-scheduler:40439': {'rank': 0}, 'tcp://dask-scheduler:39645': {'rank': 1}}, False, 0)
kwargs: {}
Traceback (most recent call last):
File "/opt/conda/envs/rapids/lib/python3.7/site-packages/distributed/worker.py", line 4553, in run
result = await function(*args, **kwargs)
File "/opt/conda/envs/rapids/lib/python3.7/site-packages/cuml/raft/dask/common/comms.py", line 429, in _func_init_all
_func_init_nccl(sessionId, uniqueId)
File "/opt/conda/envs/rapids/lib/python3.7/site-packages/cuml/raft/dask/common/comms.py", line 484, in _func_init_nccl
n.init(nWorkers, uniqueId, wid)
File "cuml/raft/dask/common/nccl.pyx", line 151, in cuml.raft.dask.common.nccl.nccl.init
RuntimeError: NCCL_ERROR: b'invalid usage'
Does anyone know what is happening or how to mitigate this? For me the error is not so clear. I tried with several versions of RAPIDS. IMPORTANT: I'm running in a docker environment sharing all GPUs (--gpus all
) and network settings (--network host
).