TensorFlow 1.10+ custom estimator early stopping with train_and_evaluate

Question

Suppose you are training a custom tf.estimator.Estimator with tf.estimator.train_and_evaluate using a validation dataset in a setup similar to that of @simlmx's:

classifier = tf.estimator.Estimator(
    model_fn=model_fn,
    model_dir=model_dir,
    params=params)

train_spec = tf.estimator.TrainSpec(
    input_fn = training_data_input_fn,
)

eval_spec = tf.estimator.EvalSpec(
    input_fn = validation_data_input_fn,
)

tf.estimator.train_and_evaluate(
    classifier,
    train_spec,
    eval_spec
)

Often, one uses a validation dataset to cut off training to prevent over-fitting when the loss continues to improve for the training dataset but not for the validation dataset.

Currently the tf.estimator.EvalSpec allows one to specify after how many steps (defaults to 100) to evaluate the model.

How can one (if possible not using tf.contrib functions) designate to terminate training after n number of evaluation calls (n * steps) where the evaluation loss does not improve and then save the "best" model / checkpoint (determined by validation dataset) to a unique file name (e.g. best_validation.checkpoint)

Possible duplicate of [Early stopping with tf.estimator, how?](https://stackoverflow.com/questions/47137061/early-stopping-with-tf-estimator-how) — GPhilo, Oct 04 '18 at 08:16
@GPhilo similar, but not quite. It is unclear if the early stopping (`tf.contrib.estimator.stop_if_no_decrease_hook`) hook works in the `EvalSpec` — SumNeuron, Oct 04 '18 at 08:18
I'm not sure I get your comment. The `EvalSpec` only specifies how the evaluation is done. The early-stop hook decides, with a policy, to cut the training after a series of non-improving evaluations. Each of those will be executed according to the EvalSpec you provide, the early-stop hook is agnostic to the specific evaluation specification and only cares about the result of an evaluation cycle — GPhilo, Oct 04 '18 at 08:25
@GPhilo it is likely that I am wrong, but to my current understanding of [stop_if_no_decrease_hook](https://www.tensorflow.org/api_docs/python/tf/contrib/estimator/stop_if_no_decrease_hook) the argument `max_steps_without_decrease` (int, maximum number of training steps with no decrease in the given metric) uses the `TrainSpec` input function rather than the `EvalSpec` input function? — SumNeuron, Oct 04 '18 at 08:28

GPhilo · Accepted Answer · 2018-10-04T11:10:50.013

I understand your confusion now. The documentation for stop_if_no_decrease_hook states (emphasis mine):

max_steps_without_decrease: int, maximum number of training steps with no decrease in the given metric.

eval_dir: If set, directory containing summary files with eval metrics. By default, estimator.eval_dir() will be used.

Looking through the code of the hook (version 1.11), though, you find:

def stop_if_no_metric_improvement_fn():
    """Returns `True` if metric does not improve within max steps."""

    eval_results = read_eval_metrics(eval_dir) #<<<<<<<<<<<<<<<<<<<<<<<

    best_val = None
    best_val_step = None
    for step, metrics in eval_results.items(): #<<<<<<<<<<<<<<<<<<<<<<<
      if step < min_steps:
        continue
      val = metrics[metric_name]
      if best_val is None or is_lhs_better(val, best_val):
        best_val = val
        best_val_step = step
      if step - best_val_step >= max_steps_without_improvement: #<<<<<
        tf_logging.info(
            'No %s in metric "%s" for %s steps, which is greater than or equal '
            'to max steps (%s) configured for early stopping.',
            increase_or_decrease, metric_name, step - best_val_step,
            max_steps_without_improvement)
        return True
    return False

What the code does is load the evaluation results (produced with your EvalSpec parameters) and extract the eval results and the global_step (or whichever other custom step you use to count) associated with the specific evaluation record.

This is the source of the training steps part of the docs: the early stopping is not triggered according to the number of non-improving evaluations, but to the number of non-improving evals in a certain step range (which IMHO is a bit counter-intuitive).

So, to recap: Yes, the early-stopping hook uses the evaluation results to decide when it's time to cut the training, but you need to pass in the number of training steps you want to monitor and keep in mind how many evaluations will happen in that number of steps.

Examples with numbers to hopefully clarify more

Let's assume you're training indefinitely long having an evaluation every 1k steps. The specifics of how the evaluation runs is not relevant, as long as it runs every 1k steps producing a metric we want to monitor.

If you set the hook as hook = tf.contrib.estimator.stop_if_no_decrease_hook(my_estimator, 'my_metric_to_monitor', 10000) the hook will consider the evaluations happening in a range of 10k steps.

Since you're running 1 eval every 1k steps, this boils down to early-stopping if there's a sequence of 10 consecutive evals without any improvement. If then you decide to rerun with evals every 2k steps, the hook will only consider a sequence of 5 consecutive evals without improvement.

Keeping the best model

First of all, an important note: this has nothing to do with early stopping, the issue of keeping a copy of the best model through the training and the one of stopping the training once performance start degrading are completely unrelated.

Keeping the best model can be done very easily defining a tf.estimator.BestExporter in your EvalSpec (snippet taken from the link):

  serving_input_receiver_fn = ... # define your serving_input_receiver_fn
  exporter = tf.estimator.BestExporter(
      name="best_exporter",
      serving_input_receiver_fn=serving_input_receiver_fn,
      exports_to_keep=5) # this will keep the 5 best checkpoints

  eval_spec = [tf.estimator.EvalSpec(
    input_fn=eval_input_fn,
    steps=100,
    exporters=exporter,
    start_delay_secs=0,
    throttle_secs=5)]

If you don't know how to define the serving_input_fn have a look here

This allows you to keep the overall best 5 models you obtained, stored as SavedModels (which is the preferred way to store models at the moment).

thank you for the clarification and explanation. How could I, in this hook, export the best model (according to a given metric) to a specific checkpoint? — SumNeuron, Oct 04 '18 at 10:58
useful I appreciate it, however the docs for `serving_input_fn` like much of the docs, leave a lot to be desired. What gets passed to to the function? what if it is a sequence example? why isn't there a default for it? From reading the those docs and what it is supposed to [return](https://www.tensorflow.org/api_docs/python/tf/estimator/export/ServingInputReceiver) I do not see what I am suppose to write in that function. — SumNeuron, Oct 04 '18 at 14:04
Also, while it doesnt have anything to do with early stopping is what part of the OP :) If you care to go into depth about the serving function with a light example (maybe in a google colab) that would be cool. — SumNeuron, Oct 04 '18 at 14:04
"*What gets passed to to the function?*" Anything looking like what you return from your `input_fn` that then is used as "feature" in your model. Essentially, it gets the input that is fed directly to the model's input by replacing your `input_fn` with a placeholder with the appropriate shape. "*what if it is a sequence example?*" Nothing special, the placeholder will have the appropriate shape for that (you define its shape). "*why isn't there a default for it?*" There are some functions to help you, but there can't be a one-fits-all default because the input_shape of your model is unknown. — GPhilo, Oct 04 '18 at 14:11
To understand this, however, you need to look up how saved models are used when serving for inference (keywords: estimator export model, savedmodel, etc) — GPhilo, Oct 04 '18 at 14:12
@GPhile please see https://stackoverflow.com/questions/52874647/tensorflow-v1-10-why-is-an-input-serving-receiver-function-needed-when-checkpoi — SumNeuron, Oct 18 '18 at 13:00
@GPhilo this doesn't work in distributed setting? According to the documentation tensorflow.org/versions/r1.15/api_docs/python/tf/estimator/… , Caveat: Current implementation supports early-stopping both training and evaluation in local mode. In distributed mode, training can be stopped but evaluation (where it's a separate job) will indefinitely wait for new model checkpoints to evaluate, so you will need other means to detect and stop it. Early-stopping evaluation in distributed mode requires changes in train_and_evaluate API and will be addressed in a future revision — thinkdeep, Jan 05 '21 at 02:13
Estimators always had issues in distributed settings, that's one of the reasons they moved away from them in favour of Keras. If you have the possibility, I'd recommend you to try switch as well. — GPhilo, Jan 05 '21 at 06:28

TensorFlow 1.10+ custom estimator early stopping with train_and_evaluate

1 Answers1

Examples with numbers to hopefully clarify more

Keeping the best model

Linked