TensorFlow TFRecordDataset shuffle buffer_size behavior

Question

It's unclear to me what the buffer_size parameter in tf.TFRecordDataset does. Let's say we have the following code:

dataset = dataset.shuffle(buffer_size=10000).repeat().batch(batch_size)

Does this mean that only the first 10k samples will be used and repeated forever, or will I go through the entire dataset? If not, what does it to exactly? And what about this code?

dataset = dataset.repeat().shuffle(buffer_size=10000).batch(batch_size)

I've noticed this post, but it doesn't say anything about buffer_size.

There is a standard example in the official document. I think the `buffer_size` is similar to the memory capacity for data prefetching. — mining, Feb 15 '18 at 22:31

score 5 · Accepted Answer · answered Feb 16 '18 at 00:26

This answer might be useful to better understand the buffer_size parameter of the shuffle method.

In short, the dataset will always have more than buffer_size elements in its buffer, and will shuffle this buffer each time an element is added.

So having a buffer size of 1 is like not shuffling, having a buffer of the length of your dataset is like a traditional shuffling.

To understand the right order between shuffling and repeating the dataset, please look at the official performance guide.

The best practice is usually to shuffle then repeat, as this will ensure that you see the whole dataset each epoch.

Thanks, the other answer explains it quite well and it's really useful to keep it in mind! — demonFudgePies, Feb 16 '18 at 10:51

TensorFlow TFRecordDataset shuffle buffer_size behavior

1 Answers1