Instead LBFGS, using gradient descent in sparse autoencoder

Question

In Andrew Ng's lecture notes, they use LBFGS and get some hidden features. Can I use gradient descent instead and produce the same hidden features? All the other parameters are the same, just change the optimization algorithm.

Because When I use LBFGS, my autoencoder can produce the same hidden features as in the lectures notes, but when I use gradient descent, the features in the hidden layer are gone, seems like totally random.

To be specific, in order to optimize the cost function, I implement 1)the cost function, 2)gradient of each Weight and Bias. And throw them into scipy optimize tool box to optimize the cost function. And this setting can give me the reasonable hidden features.

But when I change to gradient descent. I tried to let the "Weight - gradient of the Weight" and "Bias - gradient of the Bias". But the resulted hidden features looks like totally random.

Can somebody help me to know the reason? Thanks.

`they use LBFGS and get some hidden features. Can I use gradient descent instead and produce the same hidden features?` - in principle yes. At least if both converge. Gradient descent however can be painfully slow for some functions, so you may not end up in a local optimum in reasonable amount of time. Also the choice of the step-size will be critical if you want to implement the optimization yourself. — cel, May 16 '16 at 07:49

score 1 · Answer 1 · answered May 16 '16 at 23:51

1

Yes, you can use SGD instead, in fact, it is the most popular choice in practise. L-BFGS-B is not a typical method for training neural networks. However:

you will have to tweak hyperparameters of the training method, you cannot just use the same ones that were used for LBFGS as this is completely different method (ok, not completely, but it uses first order optimization instead of second order)
you should include momentum in your SGD, it is an extremely easy way to get a kind of second order approximation, and is known to (when carefully tuned) perform as good as actual second-order methods in practise

answered May 16 '16 at 23:51

lejlot

64,777
8
131
164

If I don't use the same parameters, can I use the same structure of the neural network, e.g. the number of the nodes in the hidden layer. Thanks for your reply. It is very helpful. – iTS May 17 '16 at 05:16
Yes, structure of the network is kind of independent on learning scheme, of course there are structures for which we have specific optimizers, L-BFGS-B is not one of these, thus you can always change it to SGD+momentum – lejlot May 18 '16 at 20:14

Instead LBFGS, using gradient descent in sparse autoencoder

1 Answers1