Using Tensorflow Datasets and Estimators with More Data than Ram

Question

I've recently switched my modeling framework to use custom Tensorflow Estimators and Datasets, and am quite happy overall with this workflow.

However, I've just noticed an issue with how my dataset_input_fn loads data form tfrecords. My input function is modeled after the example in the Tensorflow documentation. The issue arises when I have more examples than I can fit into RAM. If I have 1e6 examples, and set my shuffle buffer_size to 1e5, a subset of 1e5 examples is selected once, shuffled, and then iterated on. Meaning my model is only trained on 10% of my overall dataset. My code that sets up this behavior is borrowed exactly from the Tensorflow documentation example code:

dataset = dataset.map(parser)
dataset = dataset.shuffle(buffer_size=10000)
dataset = dataset.batch(32)
dataset = dataset.repeat(num_epochs)
iterator = dataset.make_one_shot_iterator()

My question: is it possible to fill the shuffle buffer with new examples outside of the initial 1e5 as I train? Is this type of functionality supported with a one_shot_iterator? Do I need to use an initializable iterator?

Thanks!

Yes, I would use `iterator = dataset.make_initializable_iterator()` instead and `dataset = dataset.repeat(None)` (let it repeat indefinitely). — Maosi Chen, Jan 31 '18 at 19:01
Thanks for the quick response - unfortunately, using an initializable iterator with estimators is not so straight forward. I found this somewhat relevant post: https://stackoverflow.com/questions/45011724/how-to-use-tf-datas-initializable-iterators-within-a-tf-estimators-input-fn — Stephen W., Jan 31 '18 at 19:59
This post might be helpful: https://stackoverflow.com/questions/44132579/feed-data-into-a-tf-contrib-data-dataset-like-a-queue — Kane C, Jan 31 '18 at 20:20
[You do not need to mark questions as "SOLVED" via editing the title](//meta.stackexchange.com/a/116105/295637), or [posting updates/thanks in posts](//meta.stackexchange.com/a/109959/295637). Simply add your own answer, and mark as accepted. Anything additional can be perceived as noise for future visitors. See: [Can I answer my own question?](//stackoverflow.com/help/self-answer). — Blue, Feb 01 '18 at 17:54

Stephen W. · Accepted Answer · 2018-02-02T13:01:54.613

I have found what appears to be a tenable workaround for now. Through some experimentation, I learned that when instantiating a TFRecordDataset,

filenames = ["file1.tfrecord", ..., "filen.tfrecord"]
dataset = tf.data.TFRecordDataset(filenames)

and setting up a shuffle buffer:

 dataset = dataset.shuffle(buffer_size=10000)

the buffer is only populated with the first 10000 examples from however many tf records that requires. For example, in my case, I have ~300 tfrecord files containing 4096 examples each. On examination, my shuffle buffer appears to consists only of examples from the first 3 tf records in my filenames list. Since my filenames list is static, this means that my model is only trained of my first 3 tfrecords!

My fix for now is pretty simple. In my training loop I already alternate between Estimator.train and Estimator.evaluate, and I noticed that each time I call Estimator.train, the shuffle buffer is repopulated. My solution then is to shuffle my filenames each time my input_fn is called. This is not a very elegant solution, but does achieve the desired effect of allowing my to iterate across all tfrecords.

#My Crappy Fix: shuffle file names in input_fn
np.random.shuffle(filenames)
dataset = tf.data.TFRecordDataset(filenames)

What's annoying about this solution (aside from its kludginess) is that my minibatches are not "globally random". Rather, they are selected form a small subset of tf records, and only that subset is used for each training/evaluation cycle. One way to mitigate this is to increase my shuffle buffer size or decrease my tfrecord size, I'll probably do both of these. Finally, I think it's worth noting that if

shuffle_buffer_size < (tf_record_size + minibatch_size)

then, as far as I can tell, my TFRecordDataset will pull from a single tfrecord file!

Finally, I don't think the relevant tensorflow documentation conveys these complexities well. The documentation alludes to the ability to train on large datasets that don't fit into memory, but doesn't provide much detail. It seems unlikely that the tf authors had in mind my hacky strategy when writing this, so I remain curious to see if there's a better approach.

Using Tensorflow Datasets and Estimators with More Data than Ram

1 Answers1