I have been trying to run the dga detection streamz on the rapidsai clx streamz docker container for the last few days without any resolution.I'm following the instructions on the rapids website: https://docs.rapids.ai/api/clx/legacy/intro-clx-streamz.html. I'm able to build the cocker container from the Dockerfile and run the container, but when I try to run the dga streamz it fails with the following error:
/opt/conda/envs/rapids/lib/python3.10/site-packages/dask_cuda/utils.py:257: UserWarning: Cannot get CPU affinity for device with index 0, setting default affinity
warnings.warn(
Creating local cuda cluster as no dask scheduler is provided.
2023-05-02 20:22:18,194 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize
2023-05-02 20:22:18,194 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2023-05-02 20:22:20,839 - distributed.worker - WARNING - Run Failed
Function: worker_init
args: ()
kwargs: {}
Traceback (most recent call last):
File "/opt/conda/envs/rapids/lib/python3.10/site-packages/distributed/worker.py", line 3233, in run
result = function(*args, **kwargs)
File "/opt/clx_streamz/python/dga_detection.py", line 37, in worker_init
worker = dask.distributed.get_worker()
AttributeError: module 'dask' has no attribute 'distributed'
<Client: 'tcp://127.0.0.1:33711' processes=1 threads=1, memory=15.25 GiB>
Traceback (most recent call last):
File "/opt/clx_streamz/python/dga_detection.py", line 53, in <module>
dga_detection.start()
File "/opt/clx_streamz/python/clx_streamz_tools/streamz_workflow.py", line 141, in start
client.run(self.worker_init)
File "/opt/conda/envs/rapids/lib/python3.10/site-packages/distributed/client.py", line 2916, in run
return self.sync(
File "/opt/conda/envs/rapids/lib/python3.10/site-packages/distributed/utils.py", line 338, in sync
return sync(
File "/opt/conda/envs/rapids/lib/python3.10/site-packages/distributed/utils.py", line 405, in sync
raise exc.with_traceback(tb)
File "/opt/conda/envs/rapids/lib/python3.10/site-packages/distributed/utils.py", line 378, in f
result = yield future
File "/opt/conda/envs/rapids/lib/python3.10/site-packages/tornado/gen.py", line 769, in run
value = future.result()
File "/opt/conda/envs/rapids/lib/python3.10/site-packages/distributed/client.py", line 2821, in _run
raise exc
File "/opt/clx_streamz/python/dga_detection.py", line 37, in worker_init
worker = dask.distributed.get_worker()
AttributeError: module 'dask' has no attribute 'distributed'
2023-05-02 20:22:20,849 - distributed.scheduler - WARNING - Received heartbeat from unregistered worker 'tcp://127.0.0.1:37355'.
2023-05-02 20:22:20,849 - distributed.worker - ERROR - Failed to communicate with scheduler during heartbeat.
Traceback (most recent call last):
File "/opt/conda/envs/rapids/lib/python3.10/site-packages/distributed/comm/tcp.py", line 225, in read
frames_nbytes = await stream.read_bytes(fmt_size)
tornado.iostream.StreamClosedError: Stream is closed
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/opt/conda/envs/rapids/lib/python3.10/site-packages/distributed/worker.py", line 1215, in heartbeat
response = await retry_operation(
File "/opt/conda/envs/rapids/lib/python3.10/site-packages/distributed/utils_comm.py", line 419, in retry_operation
return await retry(
File "/opt/conda/envs/rapids/lib/python3.10/site-packages/distributed/utils_comm.py", line 404, in retry
return await coro()
File "/opt/conda/envs/rapids/lib/python3.10/site-packages/distributed/core.py", line 1221, in send_recv_from_rpc
return await send_recv(comm=comm, op=key, **kwargs)
File "/opt/conda/envs/rapids/lib/python3.10/site-packages/distributed/core.py", line 986, in send_recv
response = await comm.read(deserializers=deserializers)
File "/opt/conda/envs/rapids/lib/python3.10/site-packages/distributed/comm/tcp.py", line 241, in read
convert_stream_closed_error(self, e)
File "/opt/conda/envs/rapids/lib/python3.10/site-packages/distributed/comm/tcp.py", line 144, in convert_stream_closed_error
raise CommClosedError(f"in {obj}: {exc}") from exc
distributed.comm.core.CommClosedError: in <TCP (closed) ConnectionPool.heartbeat_worker local=tcp://127.0.0.1:52698 remote=tcp://127.0.0.1:33711>: Stream is closed
Here is my GPU specs:
nvidia-smi
Tue May 2 16:31:18 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.50 Driver Version: 531.79 CUDA Version: 12.1 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 3080 T... On | 00000000:01:00.0 On | N/A |
| N/A 39C P8 11W / N/A| 464MiB / 16384MiB | 5% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
The docker image that I'm using is:
docker.io/rapidsai/rapidsai-clx:23.02-cuda11.8-runtime-ubuntu20.04-py3.10