TF Dataset API: Is the following sequence correct? map,cache,shuffle,batch,repeat,prefetch

Question

I am using this sequence to read images files from disk and feed into a TF Keras model.

  #Make dataset for training
    dataset_train = tf.data.Dataset.from_tensor_slices((file_ids_training,file_names_training))
    dataset_train = dataset_train.flat_map(lambda file_id,file_name: tf.data.Dataset.from_tensor_slices(
        tuple (tf.py_func(_get_data_for_dataset, [file_id,file_name], [tf.float32,tf.float32]))))
    dataset_train = dataset_train.cache()

    dataset_train= dataset_train.shuffle(buffer_size=train_buffer_size)
    dataset_train= dataset_train.batch(train_batch_size) #Make dataset, shuffle, and create batches
    dataset_train= dataset_train.repeat()
    dataset_train = dataset_train.prefetch(1)
    dataset_train_iterator = dataset_train.make_one_shot_iterator()
    get_train_batch = dataset_train_iterator.get_next()

I am having questions on whether this is the most optimal sequence. For e.g. Should repeat come after shuffle() and before batch()?, Should cache() come after batch?

Would really appreciate a clarification from @mrry or others. Specifically, I want to know the difference between keeping repeat before and after the .batch method. For e.g., if I keep .repeat after the .batch does it repeat the shuffled batches or does it repeat shuffled data? — siby, Aug 15 '18 at 16:44
I also would like to know how the order effects the commands. For example what `prefetch` prefetches depending on the order it appears. If I would do a `.batch().prefetch(2)` does this now mean 2 batches are prefetched or still 2 samples? (this was just a examples, I'm seeking a general explanation for all commands in all important orderings). — Spenhouet, Jan 18 '19 at 16:07
I personally don't see your question answered in the current accepted answer. As long as it stays as "answered" I don't see someone like @mrry coming in and adding a complete answer. — Spenhouet, Jan 18 '19 at 16:10
[TF 2.0 Documentation] (https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/data/Dataset#prefetch) Note: Like other Dataset methods, prefetch operates on the elements of the input dataset. It has no concept of examples vs. batches. examples.prefetch(2) will prefetch two elements (2 examples), while examples.batch(20).prefetch(2) will prefetch 2 elements (2 batches, of 20 examples each). — Sunny, Sep 06 '19 at 04:33

score 4 · Accepted Answer · answered Oct 08 '18 at 21:38

4

The answer here Output differences when changing order of batch(), shuffle() and repeat() suggests repeat or shuffle before batching. The order I often use is (1) shuffle, (2) repeat, (3) map, (4) batch but it can vary based on your preferences. I use shuffle before repeat to avoid blurring epoch boundaries. I use map before batch because my mapping function applies to a single example (not to a batch of examples) but you can certainly write a map function that is vectorized and expects to see a batch as input.

answered Oct 08 '18 at 21:38

veritessa

226
2
4

1

Can anyone point to an example of a map that is vectorized and takes a batch as input? – maurice Feb 07 '19 at 23:28
Note sure if this answers your question @maurice, but tf.data.experimental.unbatch() applies to the dataset as whole, so vectorised in a sense: dataset.apply(tf.data.experimental.unbatch()) – Abbas Apr 16 '20 at 16:14

Talha Ilyas · Answer 2 · 2022-08-02T14:27:14.010

I'd suggest using the following order


dataset
  .cache(filename='./data/cache/')
  .shuffle(BUFFER_SIZE)
  .repeat(Epoch)
  .map(func, num_parallel_calls=tf.data.AUTOTUNE)
  .filter(fltr)
  .batch(BATCH_SIZE)
  .prefetch(tf.data.AUTOTUNE)

in this way firstly to further speed up the training the processed data will be saved in binary format (done automatically by tf) by calling cache. The data will be saved in the cache file after, all the dataset is shuffled and repeated. After that just like @shivaraj said use map and filter function before batching the data. Lastly call the prefetch as said in tf documentation to prepare the data before hand while gpu is working on the previous batch.

Note:

Calling cache will take a lot of time on first call depending on the data size and memory available. But it speed up the training by at least 4 times, if you need to do multiple experiments while not making any change to dataset's inputs and outputs (labels). Also changing the order of calling cache will also effect the time it takes to create the cache files. I found this order to be the fastest, in every term and also doesn't raises any warnings.

score 0 · Answer 3 · answered Jul 11 '22 at 10:05

If you are reading images and preprocessing through a function, then use batch after map function.

If you use batch before map then then the function does not get filenames instead map function will get list of rank 1.

ValueError: Shape must be rank 0 but is rank 1 for '{{node ReadFile}} = ReadFile[](args_0)' with input shapes: [?].

Hence the sequence is

dataset = tf.data.Dataset.from_tensor_slices(file_paths)

dataset = dataset.shuffle(BUFFER_SIZE)

dataset = dataset.repeat()  # can be after batch

dataset = dataset.map(parse_images)

dataset = dataset.batch(BATCH_SIZE)./repeat()/.prefetch(tf.data.AUTOTUNE)

Although you can choose to place repeat after batch also which doesn't affect your execution.

The buffer size in shuffle actually decides the magnitude of randomness you can introduce, bigger the buffer size better is randomness but you need to have better RAM size (usually > 8 Gigs).

TF Dataset API: Is the following sequence correct? map,cache,shuffle,batch,repeat,prefetch

3 Answers3

Note: