Hi I am using ML Engine with a custom tier made up of a complex_m master, four workers each with a GPU and one complex_m as parameter server.
The model is training a CNN. However, there seem to be trouble with the workers. This is an image of the logs https://i.stack.imgur.com/VJqE0.png.
The master still seems to be working because there are session checkpoints being saved, however, this is nowwhere near the speed it should be.
With complex_m workers, the model works. It just gives a waiting for the model to be ready in the beginning (i assume it is until the master intializes global variables, correct me if i am wrong) and then works normally. With GPUs however there seem to be a problem with the task.
I didnt' use the tf.Device() function anywhere, in the cloud i thought the device is set automatically if a GPU is available.
I followed the Census example and loaded the TF_CONFIG environment variable.
tf.logging.info('Setting up the server')
tf_config = os.environ.get('TF_CONFIG')
# If TF_CONFIG is not available run local
if not tf_config:
return run('', True, *args, **kwargs)
tf_config_json = json.loads(tf_config)
cluster = tf_config_json.get('cluster')
job_name = tf_config_json.get('task', {}).get('type')
task_index = tf_config_json.get('task', {}).get('index')
# If cluster information is empty run local
if job_name is None or task_index is None:
return run('', True, *args, **kwargs)
cluster_spec = tf.train.ClusterSpec(cluster)
server = tf.train.Server(cluster_spec,
job_name=job_name,
task_index=task_index)
# Wait for incoming connections forever
# Worker ships the graph to the ps server
# The ps server manages the parameters of the model.
if job_name == 'ps':
server.join()
return
elif job_name in ['master', 'worker']:
return run(server.target, job_name == 'master', *args, **kwargs)
Then used the tf.replica_device_setter before defining the main graph.
As a session i am using tf.train.MonitoredTrainingSession, this should handle the initialization of variables and checkpoint saving. I do not know why the workers are saying that the variables are not initialized.
Variables to be initialized are all variables: https://i.stack.imgur.com/hAHPL.png
Optimizer: AdaDelta
I appreciate the help!