1

Edit:

To clarify why this question is different from the suggested duplicates, this SO question follows up on those suggested duplicates, on what exactly is Keras doing with the techniques described in those SO questions. The suggested duplicates specify using a dataset API make_one_shot_iterator() in model.fit, my follow up is that make_one_shot_iterator() can only go through the dataset once, however in the solutions given, several epochs are specified.


This is a follow up to these SO questions

How to Properly Combine TensorFlow's Dataset API and Keras?

Tensorflow keras with tf dataset input

Using tf.data.Dataset as training input to Keras model NOT working

Where "Starting from Tensorflow 1.9, one can pass tf.data.Dataset object directly into keras.Model.fit() and it would act similar to fit_generator". Each example has a TF dataset one shot iterator fed into Kera's model.fit.

An example is given below

# Load mnist training data
(x_train, y_train), _ = tf.keras.datasets.mnist.load_data()
training_set = tfdata_generator(x_train, y_train,is_training=True)

model = # your keras model here              
model.fit(
    training_set.make_one_shot_iterator(),
    steps_per_epoch=len(x_train) // 128,
    epochs=5,
    verbose = 1)

However, according the the Tensorflow Dataset API guide (here https://www.tensorflow.org/guide/datasets ) :

A one-shot iterator is the simplest form of iterator, which only supports iterating once through a dataset

So it's only good for 1 epoch. However, the codes in the SO questions specify several epochs, with the code example above specifying 5 epochs.

Is there any explanation for this contradiction? Does Keras somehow know that when the one shot iterator has gone through the dataset, it can re-initialize and shuffle the data?

SantoshGupta7
  • 5,607
  • 14
  • 58
  • 116
  • @Sharky This is a follow up to the links you included in your comments, on how exactly Keras is handling the methods described in those links. Also for your 2nd link the solutions also use iterators from the dataset API. – SantoshGupta7 Mar 31 '19 at 19:41
  • See Dat Nguyen's answer. It clearly states the current recommended way of using dataset api with keras – Sharky Mar 31 '19 at 20:13
  • I have read Dat Nguyen's answer, my question is actually a follow up on that answer provided; my question is a follow up on the current recommended way of using dataset api with keras. I have updated the text in my original question to further clarify this. – SantoshGupta7 Mar 31 '19 at 20:18
  • Ok, sorry. didn't mention this. – Sharky Mar 31 '19 at 20:26

1 Answers1

4

You can simply pass dataset object to model.fit, Keras will handle iteration. Considering one of pre-made datasets:

train, test = tf.keras.datasets.cifar10.load_data()
dataset = tf.data.Dataset.from_tensor_slices((train[0], train[1]))

This will create dataset object from training data of cifar10 dataset. In this case parse function isn't needed. If you create dataset from path containing images of list of numpy arrays you'll need one.

dataset = tf.data.Dataset.from_tensor_slices((image_path, labels_path)) 

In case you'll need a function to load actual data from filename. Numpy array can be handled the same way just without tf.read_file

def parse_func(filename):
    f = tf.read_file(filename)
    image = tf.image.decode_image(f)
    label = #get label from filename
    return image, label

Then you can shuffle, batch, and map any parse function to this dataset. You can control how many examples will be preloaded with shuffle buffer. Repeat controls epoch count and better be left None, so it will repeat indefinitely. You can use either plain batch function or combine with

dataset = dataset.shuffle().repeat()
dataset.apply(tf.data.experimental.map_and_batch(map_func=parse_func, batch_size,num_parallel_batches))

Then dataset object can be passed to model.fit model.fit(dataset, epochs, steps_per_epoch). Note that steps_per_epoch is a necessary parameter in this case, it will define when to start new epoch. So you'll have to know epoch size in advance.

Sharky
  • 4,473
  • 2
  • 19
  • 27
  • Thanks! So sas there a mistake in the answers given in the three stackoverflow questions linked, where they call the iterator function in the dataset object passed to `model.fit`. For example, Dat Nguyen answer has `model.fit( training_set.make_one_shot_iterator() , ....` . Is this a mistake, and `.make_one_shot_iterator()` should not have been called? So only `model.fit( training_set , ....` – SantoshGupta7 Apr 01 '19 at 19:07
  • 1
    It's not a mistake, it's just api has been changed since. Now iterator is not needed – Sharky Apr 01 '19 at 19:19
  • Say that I specify the batch size in the dataset using the batch function `dataset.batch(batch_size)`, would I still need to specify `steps_per_epoch`? Or do I just need to make sure that whatever batch size I specify will need to match the resulting `steps_per_epoch`? I am asking this because I'll need to use `dataset.padded_batch` actually since the number of points per row varies? OR, is Keras able to handle that somehow? – SantoshGupta7 Apr 01 '19 at 19:40
  • 1
    batch size will be controlled by dataset api, and steps per epoch is needed to let it know when to start new epoch. you can also set repeat with None, and specify epochs inside fit – Sharky Apr 01 '19 at 19:50