4

I'm running training using TF r1.8.0 on a 4 GPU machine, and I'm trying to replace my existing training code using tf.data and the high level TF Estimator. My previous code largely followed the multi-GPU CIFAR10 example code found here: https://github.com/tensorflow/models/blob/master/tutorials/image/cifar10/cifar10_multi_gpu_train.py

When I replaced the existing input pipeline (queue-based) with tf.data, I sharded the dataset and created an iterator per device (following the advice of the answer to this question: How does one move data to multiple GPU towers using Tensorflow's Dataset API - specifically, i'm using #3), and everything worked out just fine.

Now, in order to take advantage of the MirroredStrategy within tf.contrib.distribute, it looks like I have to switch over to using Estimators (unless I'm mistaken?), but my question is this: Do I still need to shard my dataset based on the number of GPUs i'll be using, or do I write it as though i'm on a single device and trust in the Estimator to split each batch by the number of GPUs? I'm struggling a bit to understand what the Estimator is actually doing under the hood since the entire training loop is abstracted away...

If this has been documented clearly somewhere or asked before, I apologize in advance! fwiw, my current input pipeline looks like this:

def input_fn(tfrecords_dirpath, num_gpus, batch_size, 
             num_epochs, gpu_device, gpu_index):

    tfrecord_filepaths = tf.data.Dataset.list_files('{}/*.tfrecord'.format(tfrecords_dirpath))
    dataset = tf.data.TFRecordDataset(tfrecord_filepaths, num_parallel_reads= int(64 / num_gpus))

    dataset = dataset.shard(num_gpus, gpu_index)

    # use fused operations (shuffle_and_repeat, map_and_batch)
    dataset = dataset.apply(tf.contrib.data.shuffle_and_repeat(10000, num_epochs))
    dataset = dataset.apply(tf.contrib.data.map_and_batch(lambda x: parse_record(x), batch_size))

    # stage batches for processing by loading them pre-emptively on the GPU
    dataset = dataset.apply(tf.contrib.data.prefetch_to_device(gpu_device))

    iterator = dataset.make_one_shot_iterator()
    images_batch, labels_batch = iterator.get_next()

    return images_batch, labels_batch

and when I start training, I replicate the model in each GPU and aggregate losses:

# create a separate inference graph in every GPU
gpu_devices = ['/gpu:{}'.format(i) for i in range(num_gpus)]
with tf.variable_scope(tf.get_variable_scope()):
    for i, gpu_device in enumerate(gpu_devices):

        # create a dataset and iterator per GPU
        image_batch, label_batch = input_fn(tfrecords_dirpath, num_gpus, batch_size_per_tower, 
                                            num_epochs, gpu_device, i)
         with tf.device(gpu_device):
            with tf.name_scope('{}_{}'.format('tower', i)) as scope:

                # run inference and compute tower losses
                ...

Thanks!

1 Answers1

0

On a single machine you don't need to shard. Here is an example when using tf.distribute.MirroredStrategy with tf.estimator.train_and_evaluate. In this setting, the dataset object should be created with per-GPU batch size and TF Estimator will run it on each GPU per iteration. So if per batch GPU is B and number of GPUs is N, then the global batch size will be N*B.

Ali
  • 1,605
  • 1
  • 13
  • 19