I'm running training using TF r1.8.0 on a 4 GPU machine, and I'm trying to replace my existing training code using tf.data and the high level TF Estimator. My previous code largely followed the multi-GPU CIFAR10 example code found here: https://github.com/tensorflow/models/blob/master/tutorials/image/cifar10/cifar10_multi_gpu_train.py
When I replaced the existing input pipeline (queue-based) with tf.data, I sharded the dataset and created an iterator per device (following the advice of the answer to this question: How does one move data to multiple GPU towers using Tensorflow's Dataset API - specifically, i'm using #3), and everything worked out just fine.
Now, in order to take advantage of the MirroredStrategy within tf.contrib.distribute, it looks like I have to switch over to using Estimators (unless I'm mistaken?), but my question is this: Do I still need to shard my dataset based on the number of GPUs i'll be using, or do I write it as though i'm on a single device and trust in the Estimator to split each batch by the number of GPUs? I'm struggling a bit to understand what the Estimator is actually doing under the hood since the entire training loop is abstracted away...
If this has been documented clearly somewhere or asked before, I apologize in advance! fwiw, my current input pipeline looks like this:
def input_fn(tfrecords_dirpath, num_gpus, batch_size,
num_epochs, gpu_device, gpu_index):
tfrecord_filepaths = tf.data.Dataset.list_files('{}/*.tfrecord'.format(tfrecords_dirpath))
dataset = tf.data.TFRecordDataset(tfrecord_filepaths, num_parallel_reads= int(64 / num_gpus))
dataset = dataset.shard(num_gpus, gpu_index)
# use fused operations (shuffle_and_repeat, map_and_batch)
dataset = dataset.apply(tf.contrib.data.shuffle_and_repeat(10000, num_epochs))
dataset = dataset.apply(tf.contrib.data.map_and_batch(lambda x: parse_record(x), batch_size))
# stage batches for processing by loading them pre-emptively on the GPU
dataset = dataset.apply(tf.contrib.data.prefetch_to_device(gpu_device))
iterator = dataset.make_one_shot_iterator()
images_batch, labels_batch = iterator.get_next()
return images_batch, labels_batch
and when I start training, I replicate the model in each GPU and aggregate losses:
# create a separate inference graph in every GPU
gpu_devices = ['/gpu:{}'.format(i) for i in range(num_gpus)]
with tf.variable_scope(tf.get_variable_scope()):
for i, gpu_device in enumerate(gpu_devices):
# create a dataset and iterator per GPU
image_batch, label_batch = input_fn(tfrecords_dirpath, num_gpus, batch_size_per_tower,
num_epochs, gpu_device, i)
with tf.device(gpu_device):
with tf.name_scope('{}_{}'.format('tower', i)) as scope:
# run inference and compute tower losses
...
Thanks!