Tensorflow Distributed Multi-Machine Training: Failing to connect to nodes other than self

Question

If I only use nodes 2-4 in TF_CONFIG, the program just hangs, where if I have one set as the primary (node01) the script throws the error logs below. I believe node01 is delegated as chief due to placement so this may be why it fails further down the line, but the issue still remains.

Code:

import os
import json

import tensorflow as tf
import mnist_setup

per_worker_batch_size = 64
tf_config = json.loads(os.environ['TF_CONFIG'])
num_workers = len(tf_config['cluster']['worker'])

strategy = tf.distribute.MultiWorkerMirroredStrategy()

global_batch_size = per_worker_batch_size * num_workers
multi_worker_dataset = mnist_setup.mnist_dataset(global_batch_size)

with strategy.scope():
  # Model building/compiling need to be within `strategy.scope()`.
  multi_worker_model = mnist_setup.build_and_compile_cnn_model()


multi_worker_model.fit(multi_worker_dataset, epochs=3, steps_per_epoch=70)

TF_CONFIG:

{'cluster': {'worker': ['node01:34425', 'node02:36257']},
 'task': {'type': 'worker', 'index': 1}}

Output:

/home/ubuntu/miniforge3/lib/python3.10/site-packages/tensorflow_io/python/ops/__init__.py:98: UserWarning: unable to load libtensorflow_io_plugins.so: unable to open file: libtensorflow_io_plugins.so, from paths: ['/home/ubuntu/miniforge3/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io_plugins.so']
caused by: ["[Errno 2] The file to load file system plugin from does not exist.: '/home/ubuntu/miniforge3/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io_plugins.so'"]
  warnings.warn(f"unable to load libtensorflow_io_plugins.so: {e}")
/home/ubuntu/miniforge3/lib/python3.10/site-packages/tensorflow_io/python/ops/__init__.py:104: UserWarning: file system plugins are not loaded: unable to open file: libtensorflow_io.so, from paths: ['/home/ubuntu/miniforge3/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io.so']
caused by: ['/home/ubuntu/miniforge3/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io.so: cannot open shared object file: No such file or directory']
  warnings.warn(f"file system plugins are not loaded: {e}")
2023-02-04 06:16:07.014328: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:447] Started server with target: grpc://node02:36257
2023-02-04 06:16:07.039956: I tensorflow/core/distributed_runtime/coordination/coordination_service_agent.cc:277] Coordination agent has successfully connected.
2023-02-04 06:16:08.150050: W tensorflow/tsl/framework/cpu_allocator_impl.cc:82] Allocation of 188160000 exceeds 10% of free system memory.
2023-02-04 06:16:12.054935: E tensorflow/core/distributed_runtime/coordination/coordination_service_agent.cc:678] Coordination agent is in ERROR: UNAVAILABLE: failed to connect to all addresses
Additional GRPC error information from remote target /job:worker/replica:0/task:0:

I have tried the other suggestions such as "unset https_proxy" and similar, however this only worked for being able to change localhost -> node01 in the tf_config file, with node01 being the only node. I also changed the iptables to allow all Input/Output traffic from the ip of each machine on the local network, still with no success. The code is an excerpt from the tutorial on tensorflow's tutorial notebook on distributed training, with the only modifications of removing the tf-nightly install, and changing localhost to the hostnames of my machines.

I am running tensorflow 2.11.0

Link here: https://github.com/tensorflow/docs/blob/master/site/en/tutorials/distribute/multi_worker_with_keras.ipynb

Tensorflow Distributed Multi-Machine Training: Failing to connect to nodes other than self

0 Answers0