Why not train for partial epochs?

Question

Nobody ever seems to run their model for say '10.5' epochs. What is the theoretical reason for this?

It is somewhat intuitive to me that if I had a training set of perfectly unique samples, the optimal knee point between undertraining and overtraining should be between full epochs. However, in most cases individual training samples will often be similar/related in one way or another.

Is there a solid statistics based reason? Or else, did anyone empirically investigate?

You can always check iteration and do early stop learning. In my view I don't see meaning to stop 10.5 or 10.2 to generalize. it is totally out of the scope of the epochs it more concerned in the weight update. — Feras, Apr 28 '17 at 15:02
The theoretical reason is that you would bias the model by doing that. — Dr. Snoopy, Apr 28 '17 at 21:24
@MatiasValdenegro Can you expand on that? How would the model end up being biased? Can you refer me to a source that says this, or even better: explains it? — 5Ke, May 02 '17 at 07:14

score 3 · Accepted Answer · answered Apr 28 '17 at 19:19

I dispute the premise: where I work, we often run for partial epochs, although the range is higher for the large data sets: say, 40.72 epochs.

For small data sets or short training, it's a matter of treating each observation with equal weight, so it's natural to think that one needs to process each the same number of times. As you point out, if the input samples are related, then it's less important to do so.

I would think that one base reason is convenience: integers are easier to interpret and discuss. For many models, there is no knee at optimal training: it's a gentle curve, such that there is almost certainly an integral number of epochs within the "sweet spot" of accuracy. Thus, it's more convenient to find that 10 epochs is a little better than 11, even if the optimal point (found with multiple training runs at tiny differences in iteration count) happens to be 10.2 epochs. Diminishing returns says that if 9-12 epochs give us very similar, good results, we simply note that 10 is the best performance in the range 8-15 epochs, accept the result, and get on with the rest of life.

Why not train for partial epochs?

1 Answers1

Linked