Since it is not "checkpoint", what is the standard method for crash-recovery to resume TensorFlow 2.0 Training?

Question

To resume training after a crash, one must restore not only the model but all objects and parameters that go into the state of a model.fit(...) process.

Before I go bother to fork the keras code to implement a fitting object includes for example, the training data, I'd like to know what the standard method, if any, is for crash-recovery to resume TensorFlow 2.0 training where it left off.

Or has someone actually filled this obviously gaping hole in the TensorFlow object model?

Shanqing Cai · Answer 1 · 2019-12-31T00:38:12.797

0

The canonical way of checkpointing a tf.keras.Model.fit() process is the ModelCheckpoint callback.

The usage looks something like:

mode.fit(..., callbacks=[tf.keras.callbacks.ModelCheckpoint(checkpoint_dir)]

The saved checkpoint, which is generated at the end of every training epoch by default, includes not only the model's architecture and weight values, but also the training state. If you're interested, you can study its source code here. The saved training state includes

the optimizer configuration
the weight variable values of the optimizer (for stateful optimizers such as Adam)
the loss and metric configuration

Do these cover all the training states you have in mind?

edited Dec 31 '19 at 00:38

answered Dec 31 '19 at 00:30

Shanqing Cai

3,756
3
23
36

My envisioned checkpoint of a `fitting` object would checkpoint the entire state of the fitting process so it could be resumed with nothing more than `fitting.resume()`. I suppose parameters could be passed to the `resume` method to modify the resumption, but that would be tailfins. – user3673 Dec 31 '19 at 00:44
Thanks for the reply. Typically, the Python training code is available when you resume from crash. With the `ModelCheckpoint` configured with the `model.fit()` call, the program will take care of restoring the previously-saved training state automatically. However, to your point, the checkpoint is actually self-contained and can be loaded without the original training Python code. In the example above, you can do `model = tf.keras.models.load_model(checkpoint_dir)`, and the reconstituted `model` object is immediately ready for `fit()` calls that remember the saved training state. – Shanqing Cai Dec 31 '19 at 02:57

Since it is not "checkpoint", what is the standard method for crash-recovery to resume TensorFlow 2.0 Training?

1 Answers1