If the dataset is small enough to fit in the GPU memory, is it possible with Tensorflow to allocate it all initially and then do the training without having data transfers between CPU and GPU? It seems to me that with tf.data this is not possible and the data transfer is not controlled by the programmer.
Analyzing the GPU workload during training, it reaches 75% with CIFAR10, but I would expect it to reach 100% being that the dataset fit in GPU memory.Also analyzing with Tensorboard I see that there are a lot of Send operations. (I saw that there is a similar question quite old here, however at that time there was no tf.data yet)