Suppose we have a simple TensorFlow model with a few convolutional layers. We like to train this model on a cluster of computers that is not equipped with GPUs. Each computational node of this cluster might have 1 or multiple cores. Is it possible out-of-the-box? If not, which packages are able to do that? Are those packages able to perform data and model parallelism?
1 Answers
According to Tensorflow documentation
tf.distribute.Strategy is a TensorFlow API to distribute training across multiple GPUs, multiple machines or TPUs.
As mentioned above it supports CPU for distributed training by considering all devices should be in the same network.
Yes, you can use multiple devices for training model and need to have cluster and worker configuration to be done on couple of devices as shown below.
tf_config = {
'cluster': {
'worker': ['localhost:1234', 'localhost:6789']
},
'task': {'type': 'worker', 'index': 0}
}
To know about the configuration and training model, please refer Multi-worker training with Keras.
According to this SO answer
tf.distribute.Strategy
is integrated totf.keras
, so whenmodel.fit
is used withtf.distribute.Strategy
instance and then usingstrategy.scope()
for your model allows to create distributed variables. This allows it to equally divide your input data on your devices.
Note: One can be benefited using distributed training when dealing with huge data and complex models (i.e. w.r.t performance).