How can I tell my neural network is converging to a local minimum?

Question

I've built a relatively simple artificial neural network in attempts to model the Value function in a Q-Learning problem, but to verify my implementation of the network was correct I am trying to solve the XOR problem.

My network architecture uses two layers both with tanh activation, a learning factor of .001, bias units, and momentum set to .9. After each training iteration, I print the error term (using squared error), and run until my error converges to ~.001. This works about 75% of the time, but the other 25% my network will converge to ~.5 error which is rather large for this problem.

Here is a sample printout of the error term:

The error oscillates as such until infinity.

So the question is: is my implementation broken, or is it possible that I am running into a local minimum?

See: http://stats.stackexchange.com/questions/126994/questions-about-q-learning-using-neural-networks and especially @zergylord's answer to Q3. — BadZen, Dec 01 '15 at 17:44
Ah yes, I've read into this post quite a bit. I've tried many of the suggestions for instance using linear output layers, ReLu layers, etc. (even Maxout, although my implementation is still a work in progress [aka broken]). But all of my attempts have lead to this same issue where my error oscillates about some large value and never fully converges towards 0. My fear is that I have a bad implementation, but this may just be a symptom of local minimum convergence. I just am not experienced enough to be able to tell the difference yet. — Andnp, Dec 01 '15 at 18:11
I recommend doing two things: 1) verify the derivative calculations at each sample in your backprop with a numerical differentiation routine (obv. for testing and not when training for real) to make sure there are no bugs there, and 2) try training with some known-correct nonlinear optimization routine that does directed line searches (say BFGS) instead of a Q-learning regimen - this class of algorithms with only step "forward" - you should never see a `J_{t+1} >= J_t` if the implementation is correct. — BadZen, Dec 01 '15 at 18:17
Thanks for the suggestions! I tested derivatives and also implemented Cross Entropy and had the same problem with much lower probability. However when training my network on a linear routine, the error decreased monotonically (as expected). I'll check to see how it behaves with BFGS. — Andnp, Dec 01 '15 at 19:39

How can I tell my neural network is converging to a local minimum?

0 Answers0