Creating training data with generator and using tf.data from_generator

Question

I'm new to the tf.data API and I'm trying to write a code to do the following:

I have to train a NN with A LOT of training data (more than my RAM can handle anyway). I don't have a dataset, but I have to generate the data myself starting from parameters uniformly distributed in some intervals.The training arrays have the shape X_train=(number_of_samples, 200) and Y_train=(number_of_samples,3). However, my point is I don't want to generate them all together, as I need to get millions and already above number_of_samples=2e6 my RAM dies.

So I looked into writing a generator and subsequently using tf.data.dataset.from_generator to create my dataset, and call that into model.fit.

However I have two issues:

1- all the examples I could find online use generators to extract training data from an already existing set, so they have their number of data already fixed, I don't. I don't understand from the examples if the generator should create a single sample, or a sample batch, or if it should be looped on all the training steps (total samples/batch size). Can someone explain this, please? :)

As of now, my code reads:

def datagen(batch_size,steps_per_epoch):

  for jj in range(steps_per_epoch):
    
       

  X_train=np.zeros(shape=(batch_size,nw))
  Y_train=np.zeros(shape=(batch_size,3))

  for ii in range(batch_size):
  
    #CODE GENERATING X_train and Y_train

  yield trans1.transform(X_train), trans2.transform(Y_train)

batch_size = 1024
nb_epoch = 10
datatot_train=1e6
datatot_val=0.2*datatot_train
steps_per_epoch=int(np.ceil(datatot_train/batch_size))
validation_steps=int(np.ceil(datatot_val/batch_size))

dataset_train= tf.data.Dataset.from_generator(datagen,args=[batch_size,steps_per_epoch],output_signature=(
        tf.TensorSpec(shape=(batch_size,nw), dtype=tf.float32),
        tf.TensorSpec(shape=(batch_size,3), dtype=tf.float32)))
dataset_val= tf.data.Dataset.from_generator(datagen,args=[batch_size,validation_steps],output_signature=(
        tf.TensorSpec(shape=(batch_size,nw), dtype=tf.float32),
        tf.TensorSpec(shape=(batch_size,3), dtype=tf.float32)))

history=model.fit(dataset_train.repeat(nb_epoch).prefetch(batch_size),
epochs=nb_epoch,steps_per_epoch=steps_per_epoch,
verbose=1, validation_data=dataset_val,
validation_steps=validation_steps)

is this correct? Is there a more efficient way? I had to add the .repeat(nb_epoch) as it kept saying that my data were not enough.

2 - I have noticed that if for some reason I stop the training forcefully and I run again model.fit, while on a fresh run the first epoch would take approx 10 minutes, the second time ETA is 28, then rises to 1h then 2, etc... this suggests me that I'm doing something wrong, but I'm not entirely sure what!

Thanks to whoever is willing to help!

Creating training data with generator and using tf.data from_generator

0 Answers0