tf.train.replica_device_setter needed with tf.contrib.learn.Experiment?

Question

I built a distributed tensorflow program using tf.estimator.Estimator, tf.contrib.learn.Experiment and tf.contrib.learn.learn_runner.run.

For now it seems to work fine. However, the tensorflow distributed tutorial uses tf.train.replica_device_setter to pin operations to jobs.

My model function does not use any with device annotation. Is this done automatically by the Experiment class or am I missing an important point?

I am further not sure, why there is a need to assign certain devices when I am using data parallism?

Thanks for any help and hints on this, Tobias

score 0 · Accepted Answer · answered Oct 30 '17 at 13:00

Variables and ops are defined in tf.estimator.Estimator, which actually uses replica_device_setter (defined here). As you can see, it assigns variables to ps jobs and ops to worker jobs, which is the common way to handle data parallelism.

replica_device_setter returns a device function that assigns ops and variables to devices. Even if you're using data parallelism, you might have many parameter servers, and a device function will ensure each parameter server gets separate variables (determined by ps_strategy of replica_device_setter). e.g. /job:ps/tasks:0 could get W1 and b1, and /job:ps/tasks:1 could get W2 and b2. The device function has to be deterministic in assigning variables to parameter servers, since the function is called every time a worker replica is instantiated, and the workers need to agree on which ps holds which variables.

tf.(contrib.)learn libraries use between-graph replication. This means that each worker replica will build a separate graph, with the non-Variable ops assigned to that worker: worker with task index 2 defines ops to /job:worker/task:2, and variables to /job:ps (which specific ps is determined by ps_strategy). This means that the worker replica will compute the ops (loss value & gradients) itself, and send the resulting variable updates (gradients) to the particular parameter servers that are responsile for holding the particular variables.

If you didn't have a mechanism to assign variables/ops to devices, it would not be clear which replica should hold which variables and ops. Assigning to specific devices might also be needed if you have several GPUs on a worker replica: even though your variables are stored on parameter servers, you would need to create the compute-intensive part of the graph once for each of your GPUs (with explicitly assigning the created ops to the relevant GPU).

Thanks for your detailed answer,@mattias ! Just to clarify: tf.(contrib).learn libraries handle the device assignment for me. But if I have special assignments (e.g. model parallism, several gpus etc) than I need to explicitly assign the devices myself? Did I get that right? — Tobias, Oct 30 '17 at 21:44
Yes, that's correct. If the above answered your question, please upvote :) — Mattias Arro, Oct 31 '17 at 01:00

tf.train.replica_device_setter needed with tf.contrib.learn.Experiment?

1 Answers1