Why no automatic termination for stochastic gradient descent in the frameworks?

Question

I checked out some notable open-source frameworks with SGD implementations - scikit-learn, vowpal-wabbit and tensor-flow.

All of them leave the task of deciding how many iterations to the user! scikit requires the user to specify it explicitly, vowpal assumes by default 1 epoch (pass thru all examples) but allows changing to any number of epochs, and tensor implements just a single step for a single example, leaving the entire iteration loop to the user.

Why is that? The task of deciding on termination isn't trivial at all- should it be decided when the loss doesn't get any better? the average loss for the last N iterations? Should the user use a validation/hold-out examples for measuring loss? Or maybe it's not the loss at all and we should check if the optimized weights aren't changing by much? Should we check after every example for termination, or once in a while?

Would be happy if someone sheds a light on this design decision, am I missing something and it can't be done internally? The theory for this area is heavy and I was hoping for some support from the frameworks..

In Vowpal Wabbit by default, training automatically ends after three consecutive passes where the holdout loss did not improve. This number can be set to different values with `--early_terminate`. — Martin Popel, Mar 02 '16 at 16:56
For large datasets with proper regularization, you are better off never terminating. So it's really a judgement call -- you look at the graph of improvement and see if you have enough patience to wait longer — Yaroslav Bulatov, Mar 02 '16 at 17:53
@YaroslavBulatov, sometimes the training is part of a bigger learning pipeline, and it can't be baby-sitted.. that's why I was looking for a guideline on how patient should I be if I want to enusre convergence to a solution which is expected to be x% far from the optimal solution... — ihadanny, Mar 03 '16 at 13:15
Yeah, there are no such thing (guarantees of being within x% of optimal solution). Also, a problem with fully automatic pipeline is that various hyper-parameters (such as learning rate) depend on the input data, so if your input data changes, you may end up diverging and getting random-like accuracy no matter how long you train. A common practical approach is to treat number of iterations like a hyper-parameter that you manually tune along with learning rate, and just hard-code that — Yaroslav Bulatov, Mar 03 '16 at 16:15

Why no automatic termination for stochastic gradient descent in the frameworks?

0 Answers0