All code is assuming Tensorflow 1.3 and Python 3.x
We are working on a GAN algorithm which has an interesting loss function.
Stage 1 - Compute only the completion/generator loss portion of the network
Iterates over the completion portion of the GAN for X iterations.
Stage 2 - Compute only the discriminator loss portion of the network
Iterates over the discriminator portion for Y iterations (but
don't train on Stage 1)
Stage 3 - Compute the full loss on the network
Iterate over both completion and discriminator for Z iterations
(training on the entire network).
We have this working single GPU. We want to make it work multi GPU since training times are long.
We have looked at the Tensorflow/models/tutorials/Images/cifar10/cifar10_multi_gpu_train.py, which talks about tower loss, averaging the towers together, computing your gradients on the GPUs then applying them on the CPU. This is a great start. However, since our loss is more complicated, it has complicated everything a bit for us.
The code is decently complicated, but is roughly similar to this, https://github.com/timsainb/Tensorflow-MultiGPU-VAE-GAN, (but that won't run because it was written around Tensorflow 0.1, so it has some oddities that I haven't gotten working, but that should give you an idea of what we're doing)
When we compute gradients, it looks something like this (pseudocode to try to highlight the important portions):
for i in range(num_gpus):
with tf.device('/gpu:%d' % gpus[i]):
with tf.name_scope('Tower_%d' % gpus[i]) as scope:
with tf.variable_scope( "generator" )
generator = build_generator()
with tf.variable_scope( "discriminator" ):
with tf.variable_scope( "real_discriminator" ) :
real_discriminator = build_discriminator(x)
with tf.variable_scope( "fake_discriminator", reuse = True ):
fake_discriminator = build_discriminator(generator)
gen_only_loss, discm_only_loss, full_loss = build_loss( generator,
real_discriminator, fake_discriminator )
tf.get_variable_scope().reuse_variables()
gen_only_grads = gen_only_opt.compute_gradients(gen_only_loss)
tower_gen_only_grads.append(gen_only_grads)
discm_only_train_vars= tf.get_collection(
tf.GraphKeys.TRAINABLE_VARIABLES, "discriminator" )
discm_only_train_vars= discm_only_train_vars+ tf.get_collection(
tf.GraphKeys.TRAINABLE_RESOURCE_VARIABLES, "discriminator" )
discm_only_grads = discm_only_opt.compute_gradients(discm_only_loss,
var_list = discm_only_train_vars)
tower_discm_only_grads.append(discm_only_grads)
full_grads = full_opt.compute_gradients(full_loss)
tower_full_grads.append(full_grads)
# average_gradients is the same code from the cifar10_multi_gpu_train.py.
We haven't changed it. Just iterates over gradients and averages
them...this is part of the problem...
gen_only_grads = average_gradients(tower_gen_only_grads)
gen_only_train = gen_only_opt.apply_gradients(gen_only_grads,
global_step=global_step)
discm_only_grads = average_gradients(tower_discm_only_grads)
discm_only_train = discm_only_opt.apply_gradients(discm_only_grads,
global_step=global_step)
full_grads = average_gradients(tower_full_grads)
full_train = full_opt.apply_gradients(full_grads, global_step=global_step)
If we call only "compute_gradients(full_loss)", the algorithm works properly on multiple GPUs. This is pretty equivalent to the code in the cifar10_multi_gpu_train.py example. The tricky part comes when need to restrict the network in stage 1 or 2.
Compute_gradients(full_loss), has a var_list parameter with a default value of None, which means it trains all the variables. How does it know to not train Tower_0 variables when in Tower_1? I ask, because when we deal with the compute_gradients( discm_only_loss, var_list = discm_only_train_vars), I need to know how to gather up the correct variables to restrict training to that portion of the network. I found one thread talking about this, but found it to be inaccurate/incomplete - "freeze" some variables/scopes in tensorflow: stop_gradient vs passing variables to minimize.
The reason being, that if you look at the code in compute_gradients, var_list is filled out with is a combination of trainable variables and trainable resource variables when None is passed in. So that's how I've limited it as well. This all works properly if we don't attempt to split across multiple GPUs.
Question 1: Now that I've split the network by towers, am I responsible for gathering up the current tower as well? Do I need to add a line like this?
discm_only_train_vars= tf.get_collection( tf.GraphKeys.TRAINABLE_VARIABLES, "Tower_{}/discriminator".format( i ) )
discm_only_train_vars= discm_only_train_vars + tf.get_collection( tf.GraphKeys.TRAINABLE_RESOURCE_VARIABLES, "Tower_{}/discriminator".format( i ) )
In order to train the proper variables for tower (and ensure I don't miss the training of those variable?)
Question 2: Probably the same answer as question 1. Getting "compute_gradients(gen_only_loss)" is a bit harder...in the non towered version, gen_only_loss never touched the discriminator, so it activated the tensors in the graph that it needed and everything was fine. However, in the towered version, when I call "compute_gradients", it returns gradients for tensors it hasn't activated yet - so some of the entries are [(None, tf.Variable), (None, tf.Variable)]. This causes average_gradients to crash because it can't convert a None value to a Tensor. This makes me think I need to restrict these as well.
The confusing thing about all of this is that the cifar example, and my full_loss example does not care about training on specific towers, but I'm guessing once I specify a var_list, any magic that compute_gradients was using to know which variables to train on which towers disappear? Do I need to worry about grabbing any other variables?