Tensorflow: How to prefetch data on the GPU from CPU tf.data.Dataset (from_generator)

Question

I am struggling with the following. I am creating a tf.data.Dataset using the from_generator method. I perform these actions on CPU as I don't want to overload my GPU memory.

The dataset consists of tuples, which contain a tf.bool 1-D mask (tf.Tensor) with fixed length, and a tf.float 2-D matrix (tf.Tensor) with variable size. The loss function is decorated using the following decorator, so I would not assume the variable size is the problem.

@tf.function(experimental_relax_shapes=True)

Ideally, the dataset is kept on the CPU, but then prefetched onto the GPU.

        def gen():
            for i, j in zip(mask_list, wmat_list):
                yield i, j

        dataset = tf.data.Dataset.from_generator(gen, output_types=(tf.bool, tf.float32))

The main training loop currently relies on tf.identity to move the data to the gpu, which is inefficient. As shown on the screenshot from Tensorboard below. Roughly 70% of the time is spend loading the data and moving it to GPU.

                for b, (mask, wmat) in enumerate(dataset):
                    with tf.GradientTape() as tape:

                        mask = tf.identity(mask)
                        wmat = tf.identity(wmat)

                        mean_error, loss = self.model.loss(mask, wmat)
                        epoch_loss += loss.numpy()
                        epoch_mean_error += mean_error.numpy()

I have tried the "prefetch_to_device" function. However, it did not move the data onto the GPU. As verified by printing e.g. mask.device in the training loop.

        gpu_transform = tf.data.experimental.prefetch_to_device('/gpu')
        dataset.apply(gpu_transform)

For me it resembles to this bug: https://github.com/tensorflow/tensorflow/issues/30929 . However, it is marked as solved and is over a year old.

Running TF 2.3 using the official Docker image.

RVH · Accepted Answer · 2020-10-09T09:00:20.177

I have found the solution to my own question.

The problem was that the tuples in the dataset did not contain tf.Tensors, but numpy arrays. Therefore, the pipeline was probably limited by the functionality of py_func().

The screenshot below show that the pipeline does not block on the CPU. However there is still a considerable MemCpy. The prefetch_to_device() still does not do anything. This is likely due to a known issue which should be fixed in TF2.4

https://github.com/tensorflow/tensorflow/issues/35563

~~The (unconfirmed) suggested workaround also did not work for me.~~ (see edit)

with tf.device("/gpu:0"):
    ds = ds.prefetch(1)

EDIT:

I have further investigated this issue and filed a bug report. It does now seem that the suggested workaround does something, but not sure if it completely prefetches in time. https://github.com/tensorflow/tensorflow/issues/43905

Tensorflow: How to prefetch data on the GPU from CPU tf.data.Dataset (from_generator)

1 Answers1

Linked