How to monitor validation loss in the training of estimators in TensorFlow?

Question

I want to ask a question about how to monitor validation loss in the training process of estimators in TensorFlow. I have checked a similar question (validation during training of Estimator) asked before, but it did not help much.

If I use estimators to build a model, I will give an input function to the Estimator.train() function. But there is no way to add another validation_x, and validation_y data in the training process. Therefore, when the training started, I can only see the training loss. The training loss is expected to decrease when the training process running longer. However, this information is not helpful to prevent overfitting. The more valuable information is validation loss. Usually, the validation loss is the U-shape with the number of epochs. To prevent overfitting, we want to find the number of epochs that the validation loss is minimum.

So this is my problem. How can I get validation loss for each epoch in the training process of using estimators?

So, it looks like you need to manually control this. Define separate input functions for training and validation dataset. Train for X steps/epochs using train() and the training input, call get validation loss using evaluate() with the validation input. Decide whether or not you want to train more. Run train() again or quit. — Mad Wombat, Nov 09 '18 at 21:13
Hi Mad Wombat, yes. My goal is to get the right number of epochs to prevent overfitting. — Han M, Nov 09 '18 at 22:16
If you want to control the training on every step, you might want to skip the estimators and implement your own training cycle. If you are OK with a bit less granularity, you can implement a simple loop where you call train() for some preset number of steps or epochs (and you can adjust it as you go) and then call evaluate() to judge your progress. This is basic python we are talking about, nothing too complicated. — Mad Wombat, Nov 10 '18 at 05:00

Olivier Dehaene · Accepted Answer · 2018-11-10T17:11:21.603

3

You need to create a validation input_fn and either use estimator.train() and estimator.evaluate() alternatively or simpy use tf.estimator.train_and_evaluate()

x = ...
y = ...

...

# For example, if x and y are numpy arrays < 2 GB
train_dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train))
val_dataset = tf.data.Dataset.from_tensor_slices((x_val_, y_val))

...

estimator = ...

for epoch in n_epochs:
    estimator.train(input_fn = train_dataset)
    estimator.evaluate(input_fn = val_dataset)

estimator.evaluate() will compute the loss and any other metrics that are defined in your model_fn and will save the events in a new "eval" directory inside your job_dir.

edited Nov 10 '18 at 17:11

answered Nov 10 '18 at 17:03

Olivier Dehaene

1,620
11
15

Won't this load the graph everytime it switches between train and evaluate (which is a time consuming process)? – Saravanabalagi Ramachandran Dec 23 '19 at 06:36
Yes, the for loop is suboptimal as it will load both the train and eval graphs for each epoch. Even if you use `tf.estimator.train_and_evalutate()` the documentation states that: _"(the) evaluation graph (including eval_input_fn) will be re-created for each evaluate call. `estimator.train` will be called only once."_ . so you will still load 1 graph per epoch. – Olivier Dehaene Dec 23 '19 at 11:40
1

If you are running estimators in a distributed setting, one solution is to have a dedicated node for evaluation. See the [`RunConfig`](https://www.tensorflow.org/api_docs/python/tf/estimator/RunConfig?version=stable) documentation. – Olivier Dehaene Dec 23 '19 at 11:51
I tried to use `tf.estimator.train_and_evaluate()`, and it seems as if it runs evaluation only once at the end, and not throughout the training. I tried to play with eval spec (like changing the throttle sec to 1 second), but it does not seem to work otherwise. Am I missing something here? @OlivierDehaene – yonatansc97 Nov 01 '20 at 14:28

How to monitor validation loss in the training of estimators in TensorFlow?

1 Answers1

Linked