Considerations for using ReLU as activation function

Question

I'm implementing a neural network, and wanted to use ReLU as the activation function of the neurons. Furthermore, I'm training the network with SDG and back-propagation. I'm testing the neural network with the paradigmatic XOR problem, and up to now, it classifies new samples correctly if I use the logistic function or the hyperbolic tangent as activation functions.

I've been reading about the benefits of using the Leaky ReLU as activation function, and implemented it, in Python, like this:

def relu(data, epsilon=0.1):
    return np.maximum(epsilon * data, data)

where np is the name for NumPy. The associated derivative is implemented like this:

def relu_prime(data, epsilon=0.1):
    if 1. * np.all(epsilon < data):
        return 1
    return epsilon

Using this function as activation I get incorrect results. For example:

Input = [0, 0] --> Output = [0.43951457]
Input = [0, 1] --> Output = [0.46252925]
Input = [1, 0] --> Output = [0.34939594]
Input = [1, 1] --> Output = [0.37241062]

It can be seen that the outputs differ greatly from the expected XOR ones. So the question would be, is there any special consideration to use ReLU as activation function?

Please, don't heasitate to ask me for more context or code. Thanks in advance.

EDIT: there is a bug in the derivative, as it only returns a single float value, and not a NumPy array. The correct code should be:

def relu_prime(data, epsilon=0.1):
    gradients = 1. * (data > epsilon)
    gradients[gradients == 0] = epsilon
    return gradients

@KrishnaKishoreAndhavarapu After modifying it I get correct results, but like 5 out of 10 times. I believe that I should get correct results every time. There is clearly something I'm missing with this activation function. — tulians, Jan 09 '17 at 12:47
Are you sure `gradients = 1. * (data > epsilon)` makes sense? What's your definition of a leaky ReLU function? This would set the gradient equal to epsilon for some data values that are greater than zero. — Nick Becker, Jan 09 '17 at 13:42
@NickBecker My definition of Leaky ReLU is the one from Wikipedia (https://en.wikipedia.org/wiki/Rectifier_(neural_networks)#Leaky_ReLUs). That line returns an array of 0's and 1's. The 0's come from all the values of that that are smaller than `epsilon`, while the 1's come from all the remaining values greater than `epsilon`. In this case, I'm using `epsilon = 0.1`. — tulians, Jan 09 '17 at 13:49
When I look at the piecewise function `f(x)` in that wikipedia section on Leaky ReLUs, I see a piecewise derivative of 1 when x > 0 and alpha otherwise. I could be missing something, though. — Nick Becker, Jan 09 '17 at 13:52
@NickBecker That piecewise behaviour is what I generate in the 2nd line of `relu_prime`. I've already used the 0.01 value in epsilon. I saw in other posts that the value of `epsilon` can be variable, as long as it is "small". — tulians, Jan 09 '17 at 13:55
`gradients == 0` will be True for values of x greater than 0 but less than epsilon, though, making the derivative epsilon for x values greater than 0 but less than epsilon. Does that follow from the f(x) definition? — Nick Becker, Jan 09 '17 at 13:57
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/132706/discussion-between-nick-becker-and-tulians). — Nick Becker, Jan 09 '17 at 14:00

Nick Becker · Accepted Answer · 2017-01-09T14:21:45.663

7

Your relu_prime function should be:

def relu_prime(data, epsilon=0.1):
    gradients = 1. * (data > 0)
    gradients[gradients == 0] = epsilon
    return gradients

Note the comparison of each value in the data matrix to 0, instead of epsilon. This follows from the standard definition of leaky ReLUs, which creates a piecewise gradient of 1 when x > 0 and epsilon otherwise.

I can't comment on whether leaky ReLUs are the best choice for the XOR problem, but this should resolve your gradient issue.

edited Jan 09 '17 at 14:21

answered Jan 09 '17 at 14:16

Nick Becker

4,059
13
19

Now I get correct results most of the time. Along with what @ArnisShaykh answered and yours, I've now learned that the activation function election depends on the data values. – tulians Jan 09 '17 at 14:23

score 6 · Answer 2 · answered Jan 09 '17 at 13:46

Short answer

Don't use ReLU with binary digits. It is designed to operate with much greater values. Also avoid using it when there is no negative values because it will basically mean that you are using a linear activation function which is not the best one. Best to use with Convolutional Neural Networks.

Long answer

Can't say if there is anything wrong with python code because i code in Java. But logic-wise, I think that using ReLU in this case is a bad decision. As we are predicting XOR there is a limited range to the values of your NN [0,1]. This is also the range of the sigmoid activation function. With ReLU you operate with values [0,infinity] which means there is an awful lot of values that you are never going to use since it is XOR. But the ReLU will still take this values into consideration and the error that you are going to get will increase. That is why you get correct answers about 50% of the time. In fact this value can be as low as 0% and as high as 99%. Moral of the story - when deciding which activation function to use try to match the range of the input values in your NN with the range of the activation function values.

Thanks for pointing out that fact. I didn't thought about it. Makes total sense. — tulians, Jan 09 '17 at 14:19

Considerations for using ReLU as activation function

2 Answers2