0

Hi I am using ML Engine with a custom tier made up of a complex_m master, four workers each with a GPU and one complex_m as parameter server.

The model is training a CNN. However, there seem to be trouble with the workers. This is an image of the logs https://i.stack.imgur.com/VJqE0.png.

The master still seems to be working because there are session checkpoints being saved, however, this is nowwhere near the speed it should be.

With complex_m workers, the model works. It just gives a waiting for the model to be ready in the beginning (i assume it is until the master intializes global variables, correct me if i am wrong) and then works normally. With GPUs however there seem to be a problem with the task.

I didnt' use the tf.Device() function anywhere, in the cloud i thought the device is set automatically if a GPU is available.

I followed the Census example and loaded the TF_CONFIG environment variable.

tf.logging.info('Setting up the server')
tf_config = os.environ.get('TF_CONFIG')

# If TF_CONFIG is not available run local
if not tf_config:
    return run('', True, *args, **kwargs)

tf_config_json = json.loads(tf_config)

cluster = tf_config_json.get('cluster')
job_name = tf_config_json.get('task', {}).get('type')
task_index = tf_config_json.get('task', {}).get('index')

# If cluster information is empty run local
if job_name is None or task_index is None:
    return run('', True, *args, **kwargs)

cluster_spec = tf.train.ClusterSpec(cluster)
server = tf.train.Server(cluster_spec,
                         job_name=job_name,
                         task_index=task_index)

# Wait for incoming connections forever
# Worker ships the graph to the ps server
# The ps server manages the parameters of the model.
if job_name == 'ps':
    server.join()
    return
elif job_name in ['master', 'worker']:
    return run(server.target, job_name == 'master', *args, **kwargs)

Then used the tf.replica_device_setter before defining the main graph.

As a session i am using tf.train.MonitoredTrainingSession, this should handle the initialization of variables and checkpoint saving. I do not know why the workers are saying that the variables are not initialized.

Variables to be initialized are all variables: https://i.stack.imgur.com/hAHPL.png

Optimizer: AdaDelta

I appreciate the help!

Mark
  • 414
  • 2
  • 13
  • Did I understand correctly that you are using 4 GPUs per worker? – rhaertel80 Jun 27 '17 at 22:22
  • No, four workers each with one GPU – Mark Jun 27 '17 at 22:31
  • I see you have one parameter server and you are not setting any devices. Which optimizer are you using? Also, is it possible to provide the names of the variables that aren't loading (they're cut out of the screenshot -- perhaps intentionally)? – rhaertel80 Jun 27 '17 at 23:28
  • Updated question with the variable names. However, all the variables of the graph are present. It seems the variables are not being initialized by the MonitoredTrainingSession. Optimizer is AdaDelta. Furthermore, in some other job attempts I also have variour repetitive logs captioning: "CreateSession still waiting for reponse from worker: ..." – Mark Jun 27 '17 at 23:42
  • Are you willing to share your code? – rhaertel80 Jun 29 '17 at 06:39
  • All the code is here: https://github.com/markcutajar/CNNMusicTagging In the cloud/trainer folder. Thanks :) – Mark Jun 29 '17 at 11:03
  • Can you add log_device_placement=True to your ConfigProto and provide the results? – rhaertel80 Jun 30 '17 at 05:29
  • I think i solved the problem. Might mean there is an error in the Census Tutorial From the dispatch function in the task.py, I pass the cluster_spec information to the run method of the workers/master and then use it as an argument in the replica_setter fn. Not sure if it, this was related or it was just incidental that it started working. Speeds are still higher on a CPU cluster though, thought the GPU would make more difference. – Mark Jun 30 '17 at 11:20
  • Using the log_device placement, most variables are placed on the PS task and only some operations are on the GPUs. On cluster of 5 CPUs speed is 17steps/sec on GPUs its 13steps/second. No sure why GPUs has less speed. They should have higher throughput. Maybe the CNN is too simple for a GPU? – Mark Jun 30 '17 at 11:22
  • Also, I'm submitting a fix to the example, thanks for noticing. – rhaertel80 Jul 07 '17 at 00:34
  • No problem! I am not entirely sure though. Because, without the cluster passed on to the run function and specified in the replica setter, the job still works with just CPUs. The only thing i noticed though is that in Tensorboard, without specifying the cluster, there would be no device variable allocation in the graph view and all nodes will show to be allocated to an unknown device. I think it needs more testing of what and how it should be specified exactly. – Mark Jul 07 '17 at 10:43
  • I think that without cluster_spec, you're effectively running training 5 times independently on 1 machine. – rhaertel80 Jul 07 '17 at 21:27
  • I don't think so, as during the evaluation hook after the checkpoints, the global step advances a considerable amount even though the master would be occupied by the evaluation process. It would be nice to have a confirmation of best practice though :) – Mark Jul 08 '17 at 00:59

1 Answers1

1

In the comments, you seem to have answered your own question (using cluster_spec in replica_setter). Allow me to address the issue of throughput of a cluster of CPUs vs. a cluster of GPUs.

GPUs are fairly powerful. You'll typically get higher throughput by getting a single machine with many GPUs rather than having many machines each with a single GPU. That's because the communication overhead becomes a bottleneck (the bandwidth and latency to main memory on the same machine is much better than communicating with a parameter server on a remote machine).

The reason for the GPUs being slower than CPUs may be due to the extra overhead of GPUs needing to copy data from main memory to the GPU and back. If you're doing a lot of parallelizable computation, then this copy is negligible. Your model may be doing too little on the GPU and the overhead may swamp the actual computation.

For more information about building high performance models, see this guide.

In the meantime, I recommend using a single machine with more GPUs to see if that helps:

{
  "scaleTier": "CUSTOM",
  "masterType": "complex_model_l_gpu",
  ...
}

Just beware, that you'll have to modify your code to assign ops to the right GPUs, probably using towers.

rhaertel80
  • 8,254
  • 1
  • 31
  • 47