Why using Google colab TPU with a larger validation dataset takes way longer time?

Question

Edit: I know that enlarge the validation data size will increase the total time each epoch. But in my case, it increases the total time cost by 3 times! That's where the problem is.

As mentioned I'm trying to train a model on Google Colab with keras tensorflow.
Here's the data information:

Train Data          :  shape-(3000, 227, 227, 1)  type--float32
Train Labels        :  shape-(3000, 2)            type--float32
Validation Data     :  shape-(200, 227, 227, 1)   type--float32
Validation Labels   :  shape-(200, 2)             type--float32

I train my model using the following command:

history = model.fit(
            x=self.standardize(self.train_data),
            y=self.train_labels,
            batch_size=1024,
            epochs=base_epochs,
            verbose=2,
            callbacks=cp_callback,
            validation_data=(self.standardize(self.val_data), self.val_labels),
        )

With 200 images as the validation set each epoch takes only 1~2s.

Now I tried to use a larger validation set with 3000 images in the validations set. In this situation, each epoch takes unbelievably 8~10s! This means that forward propagation is slower than backpropagation, which doesn't make any sense. Does anyone know where the problem is? If more details are required I'll give more specific codes.

score 2 · Answer 1 · answered Jul 04 '20 at 17:59

2

When you specify validation_data, the model will run the forward pass on all of the validation data at the end of each epoch. If you have more validation data than before, then it will take more time for validation to run.

The time it takes for model training has not changed, because your model and the size of your training data have not changed. The time it takes to run validation counts toward the time to run each epoch.

answered Jul 04 '20 at 17:59

jkr

17,119
2
42
68

From my point of view, backpropagation should be more time-consuming than forward propagation. The training process for 3000 images took less than 2s each epoch, but prediction process for validation data took about 6s, that's abnormal. – PokeLu Jul 04 '20 at 18:55
There's more to do with the validation than a simple forward pass, though. You need to run inference on all of the data, calculate metrics on the outputs, and then run any callbacks. There's overhead to all of that. – jkr Jul 04 '20 at 22:20
Will it solve part of this problem by using tf.data.Dataset? like mentioned in this answer https://stackoverflow.com/questions/59264851/google-colab-why-is-cpu-faster-than-tpu – PokeLu Jul 05 '20 at 08:58

Why using Google colab TPU with a larger validation dataset takes way longer time?

1 Answers1