ADADELTA preserving randomly initialized weights in neural network

Question

I am attempting to train a 2 hidden layer tanh neural neural network on the MNIST data set using the ADADELTA algorithm.

Here are the parameters of my setup:

Tanh activation function
2 Hidden layers with 784 units (same as the number of input units)
I am using softmax with cross entropy loss on the output layer
I randomly initialized weights with a fanin of ~15, and gaussian distributed weights with standard deviation of 1/sqrt(15)
I am using a minibatch size of 10 with 50% dropout.
I am using the default parameters of ADADELTA (rho=0.95, epsilon=1e-6)
I have checked my derivatives vs automatic differentiation

If I run ADADELTA, at first it makes gains in the error, and it I can see that the first layer is learning to identify the shapes of digits. It does a decent job of classifying the digits. However, when I run ADADELTA for a long time (30,000 iterations), it's clear that something is going wrong. While the objective function stops improving after a few hundred iterations (and the internal ADADELTA variables stop changing), the first layer weights still have the same sparse noise they were initialized with (despite real features being learned on top of that noise).

To illustrate what I mean, here is the example output from the visualization of the network. Visualization of internal state

Notice the pixel noise in the weights of the first layer, despite them having structure. This is the same noise that they were initialized with.

None of the training examples have discontinuous values like this noise, but for some reason the ADADELTA algorithm never reduces these outlier weights to be in line with their neighbors.

What is going on?

You seem to think ADADELTA is the problem, but you don't state if you tested with other methods. Did you try another method and it did not exhibit this issue? — Dan Stowell, May 11 '15 at 13:12
I tried with fixed learning rate SGD. With a high learning rate and it converged then diverged into pixel noise. With a low learning rate it took too long to converge. I don't think ADADELTA is the problem per-se, but that it should eliminate a lot of the failure cases with incorrect learning rates. — Jeremy Salwen, May 11 '15 at 23:55

ADADELTA preserving randomly initialized weights in neural network

0 Answers0