0

I have read this article about autoencoder, which is introduced by Andrew Ng. In there, he use sparity like regularization to drop connection but formular of sparsity is different from regur. So, I want to know why we don't use directly regularization term like model NNs or logistic regression : (1/2 * m) * Theta^2 ?

1 Answers1

1

First, let us start with some naming convention, both sparsity penalty and L2 penalty on weights can (and often are) called regularizers. Thus, the question should be "why use sparsity-based regularization instead of simple L2-norm based?". And there is no simple answer for this problem, since it goes not deeply into underlying mathematics and asks what is a better way to make sure our network creates a well generalizing representation - to keep parameters more or less in fixed sphere (L2 regularization, the one you propose) or to make sure that whatever we put as an input to the network, it will produce relatively simple representation (possibly at the cost of having lots of weights/neurons that rarely are used). Even on this level of abstraction it should show qualitatitative difference between these two regularizers, which will lead to building completely differnet models. Will sparsity term be better always? Probably not, nearly nothing in ML is "always better". But on average it seems like a less heuristic choice for an autoencoder - you want to have a kind of compression - thus you force your net to create compressed representation which is really ... well.. compressed (small!), while using L2 regularization would simply "squash" representation in terms of norm (since dot product through weights with small norm will not increase too much norm of the input), but it can still use "tiny bit" of each neuron, thus efficiently build a complex representation (using many units) but simply - with small activations.

lejlot
  • 64,777
  • 8
  • 131
  • 164