1

When we are dealing with Stochastic Gradient Descent, the cost function is updated based on single, random training data.

But this single entry may alter the weights to its favour and as the cost function is only dependent on that entry, the cost function might mislead us, as it isn't actually reducing the cost, but instead it is overfitting the particular entry. With the next entry, once again, the weights will be updated to favour this entry.

Won't it lead to over fitting? How do I go about resolving this issue?

Shashi Tunga
  • 496
  • 1
  • 6
  • 24
  • Read some theory of SGD (which is not really something for SO). It's all about *expectation* and *variance*. – sascha Nov 09 '17 at 18:29

1 Answers1

0

The training data isn't random - SGD iterates over all the training points (either singly or in batches). Because the loss function is calculated for data batch (or individual training point), it can be thought of as a random draw from a distribution of gradient vectors in weight space that will not match exactly the global gradient of the loss function calculated over the entirety of the training data. A single step is absolutely "over-fit" to the batch / training point, but we only take a single step in that direction (moderated by the learning rate which is typically << 1). Then we move on to the next data point (or batch) and calculate a new gradient. There is a "recency" effect (data trained more recently effectively counts more), but this is moderated by small learning rates. In aggregate over many iterations, all of the training data are equally weighted.

By doing this over all of the data in turn, each individual backprop step is taking a small random (but not uncorrelated) step in weight space. Across many training iterations, the network may be able to find its way to very good solutions (not a lot of guarantees about global optimality, but neural networks are highly expressive by their nature and can often find very good solutions). However, it may take many stepwise iterations over the same data set to converge to a local basin of attraction.

Over-fitting on training data is absolutely a concern for Neural Networks, but that's a function of their expressivity rather than the Stochastic Gradient Descent algorithm. Techniques like dropout and kernel regularizers on the training weights can provide regularization robustness, but the only way to

T3am5hark
  • 856
  • 6
  • 9