0

I'm trying to test a simple code using two remote workers. I don't know what is going on and what the error refers to.

The code is simple:

#!/usr/bin/python3

from cuml.dask.cluster import KMeans
from cuml.dask.datasets import make_blobs

from dask.distributed import Client

c = Client("dask-scheduler:8786")

centers = 5

X, _ = make_blobs(n_samples=10000, centers=centers)

k_means = KMeans(n_clusters=centers)
k_means.fit(X)

labels = k_means.predict(X)

It connects but when it tries to execute the cluster algorithm, it throws the following error:

Traceback (most recent call last):
  File "test_cuml.py", line 15, in <module>
    k_means.fit(X)
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/cuml/common/memory_utils.py", line 93, in cupy_rmm_wrapper
    return func(*args, **kwargs)
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/cuml/dask/cluster/kmeans.py", line 161, in fit
    comms.init(workers=data.workers)
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/cuml/raft/dask/common/comms.py", line 209, in init
    wait=True,
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/distributed/client.py", line 2506, in run
    return self.sync(self._run, function, *args, **kwargs)
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/distributed/client.py", line 869, in sync
    self.loop, func, *args, callback_timeout=callback_timeout, **kwargs
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/distributed/utils.py", line 332, in sync
    raise exc.with_traceback(tb)
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/distributed/utils.py", line 315, in f
    result[0] = yield future
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/tornado/gen.py", line 762, in run
    value = future.result()
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/distributed/client.py", line 2443, in _run
    raise exc.with_traceback(tb)
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/cuml/raft/dask/common/comms.py", line 429, in _func_init_all
    _func_init_nccl(sessionId, uniqueId)
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/cuml/raft/dask/common/comms.py", line 484, in _func_init_nccl
    n.init(nWorkers, uniqueId, wid)
  File "cuml/raft/dask/common/nccl.pyx", line 151, in cuml.raft.dask.common.nccl.nccl.init

The workers are reporting this issue:

distributed.worker - INFO - Run out-of-band function '_func_init_all'
distributed.worker - WARNING - Run Failed
Function: _func_init_all
args:     (b'\x95d$\x89\x9beI\xf5\xa7\x8c7M\xe8V[v', b'\x02\x00\xc8\xdd\x8fj\x07\x90\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00', False, {'tcp://dask-scheduler:40439': {'rank': 0}, 'tcp://dask-scheduler:39645': {'rank': 1}}, False, 0)
kwargs:   {}
Traceback (most recent call last):
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/distributed/worker.py", line 4553, in run
    result = await function(*args, **kwargs)
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/cuml/raft/dask/common/comms.py", line 429, in _func_init_all
    _func_init_nccl(sessionId, uniqueId)
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/cuml/raft/dask/common/comms.py", line 484, in _func_init_nccl
    n.init(nWorkers, uniqueId, wid)
  File "cuml/raft/dask/common/nccl.pyx", line 151, in cuml.raft.dask.common.nccl.nccl.init
RuntimeError: NCCL_ERROR: b'invalid usage'

Does anyone know what is happening or how to mitigate this? For me the error is not so clear. I tried with several versions of RAPIDS. IMPORTANT: I'm running in a docker environment sharing all GPUs (--gpus all) and network settings (--network host).

jcfaracco
  • 853
  • 2
  • 6
  • 21
  • How did you create your cluster, what GPUs/driver are use you using, and what version of cuML/Dask? – Nick Becker Dec 27 '21 at 16:08
  • @NickBecker I'm creating a cluster in a different machine using the core nightly docker image provided by RAPIDS. Each instance is a docker instance scheduler and workers. I'm using CUDA 11.2 because my GPU supports CUDA 11.3 (a GTX 1080). – jcfaracco Dec 27 '21 at 16:12
  • I found this bug, but it seems to be fixed: https://github.com/rapidsai/cuml/issues/3261 – jcfaracco Dec 27 '21 at 16:13

0 Answers0