Homemade deep learning library: numerical issue with relu activation

Question

For the sake of learning the finer details of a deep learning neural network, I have coded my own library with everything (optimizer, layers, activations, cost function) homemade.

It seems to work fine when benchmarking in on the MNIST dataset, and using only sigmoid activation functions.

Unfortunately I seem to get issues when replacing these with relus.

This is what my learning curve looks like for 50 epochs on a training dataset of ~500 examples:

Everything is fine for the first ~8 epochs and then I get a complete collapse on the score of a dummy classifier (~0.1 accuracy). I checked the code of the relu and it seems fine. Here are my forward and backward passes:

def fprop(self, inputs):
    return np.maximum( inputs, 0.)

def bprop(self, inputs, outputs, grads_wrt_outputs):
    derivative = (outputs > 0).astype( float)
    return derivative * grads_wrt_outputs

The culprit seems to be in the numerical stability of the relu. I tried different learning rates and many parameter initializers for the same result. Tanh and sigmoid work properly. Is this a known issue? Is it a consequence of non-continuous derivative of the relu function?

Welcome to StackOverflow. Please read and follow the posting guidelines in the help documentation, as suggested when you created this account. [Minimal, complete, verifiable example](http://stackoverflow.com/help/mcve) applies here. We cannot effectively help you until you post your MCVE code and accurately describe the problem. We should be able to paste your posted code into a text file and reproduce the problem you described. — Prune, Mar 19 '18 at 21:24
Problem isn't here, I'm guessing you run into overflow/underflow somewhere. — cs95, Mar 19 '18 at 21:29
@cᴏʟᴅsᴘᴇᴇᴅ: Do you think I should clip the value at the exit of the `relu`? Set some upper bound, such as `~15`. — Learning is a mess, Mar 19 '18 at 21:32

score 1 · Accepted Answer · answered Mar 19 '18 at 21:30

Yes, it's quite possible that the ReLU's are to blame. Most of the classic perceptron-based models, including ConvNet (the classic MNIST trainer), depend on both positive and negative weights for their training accuracy. ReLU ignores the negative characteristics, thus detracting from the model's capabilities.

ReLU is better-suited for convolution layers; it's a filter that says, "If the kernel isn't excited about this part of the input, I don't care how deep the boredom goes; just ignore it." MNIST training depends on counter-correction, allowing nodes to say "No, this isn't good, run the other way!"

Hmm, it really depends on the training data. For instance, I've found ReLUs to work great outside of CNNs, just that you need to be careful of overflows. You can handle that if you scale your data accordingly. — cs95, Mar 19 '18 at 21:38

Homemade deep learning library: numerical issue with relu activation

1 Answers1