I've recently switched my modeling framework to use custom Tensorflow Estimators and Datasets, and am quite happy overall with this workflow.
However, I've just noticed an issue with how my dataset_input_fn loads data form tfrecords. My input function is modeled after the example in the Tensorflow documentation. The issue arises when I have more examples than I can fit into RAM. If I have 1e6 examples, and set my shuffle buffer_size to 1e5, a subset of 1e5 examples is selected once, shuffled, and then iterated on. Meaning my model is only trained on 10% of my overall dataset. My code that sets up this behavior is borrowed exactly from the Tensorflow documentation example code:
dataset = dataset.map(parser)
dataset = dataset.shuffle(buffer_size=10000)
dataset = dataset.batch(32)
dataset = dataset.repeat(num_epochs)
iterator = dataset.make_one_shot_iterator()
My question: is it possible to fill the shuffle buffer with new examples outside of the initial 1e5 as I train? Is this type of functionality supported with a one_shot_iterator? Do I need to use an initializable iterator?
Thanks!