Relu Activation and Backpropagation

Question

I have implemented back-propagation for an MLP using the sigmoid activation function.

During the forward phase I store the output from each layer in memory.

After calculating the output error and output gradient vector I start to go back in reverse and calculate the hidden error for each layer (using output from current layer + weight from layer +1 + output error from layer +1). I then use the hidden error and output from layer -1 to calculate the gradient vector. Once back-propagation is complete I update the weights using the calculated gradient vectors for each layer.

My question is related to the implementation of the relu activation function. I have the following functions for applying activation functions. The first is the one I used in the initial run and the second is for the relu activation.

def sigmoid(self, a):
    o = 1/(1+np.exp(-1*a))
    return o

def relu(self, a):
    return np.maximum(0, a)

def reluDerivative(self, x):       
    return 1. * (x > 0)

To implement the relu activation function do I need to make any other changes during forward phase or back-propagation phase. I read that I might need to calculate relu derivative during the backward phase and apply but am confused by how this applies. appreciate any advice

Thats a good idea... but for now I am trying to learn the basics using jupyter :-) — unaied, Mar 24 '21 at 08:49
For back-propagation phase, you have to define the Derivative function　of relu. It's difficult to tell you how neural network does feed-forward and back-propagation. First of all, you may implement the derivative function! — Daisuke Akagawa, Mar 24 '21 at 08:53
I understand how to implement the relu during the forward phase but how to apply it during the backward phase? I have the function for calculating the relu derivative but am not sure where to apply this.`def reluDerivative(self, x): return 1. * (x > 0)` — unaied, Mar 24 '21 at 08:55
so during the backprop phase do I have to re-calculate the output of each layer using the relu derivatives? and then use this to calculate the hidden errors? — unaied, Mar 24 '21 at 09:20
These things are hard to discuss without having access to the full implementation. For example, whether or not you need to recalculate the activations will depend on how your network is set up. — Paul Brodersen, Mar 24 '21 at 11:53
I am trying to build a MLP with relu activation. lets say I have three hidden layers. — unaied, Mar 24 '21 at 14:57
When applying the sigmoid function I did the following during backpropaagation: for output later I calculated the output error using predicted and actual. I then used the output error and output from layer before output to calculate the gradient vector. Then in reverse I back propagated--- hidden error for each layer calculated using output previously calculated from current layer and weights of next layer and hidden error from next layer. Using the hidden error and output from previous layer I calculated the gradient vector for this layer. — unaied, Mar 24 '21 at 14:58

score 1 · Answer 1 · answered Mar 24 '21 at 12:02

Assuming that your class is set up currently something like this:

def logistic(z):
    return 1./(1. + np.exp(-z))


class backpropagation(object):

    ...

    def get_activation(self, a):
        return logistic(a)

    def get_delta_activation(self, a):
        y = logistic(a)
        dy = y * (1. - y)
        return dy

then the new derived class would be

class BPwithRelu(backpropagation):
    
    def get_activation(self, a):
        return np.max(0, a)

    def get_delta_activation(self, a):
        return (x > 0).astype(np.float)

how to apply the relu derivative during backprop? thanks – unaied Mar 24 '21 at 14:59 — unaied, Mar 24 '21 at 14:59

score 1 · Answer 2 · answered Mar 24 '21 at 12:33

1

When doing the backpropagation you will need the intermediate values for using the chain rule. Assuming you only have a relu followed by a sigmoid there is:

f(x) = relu(sigmoid(x))
relu(x) = max(0,x)
sigmoid(x) = 1/(1+exp(-1*a))

Deriving f(x) using the chain rule (Lagrange's notation):

f'(x) = relu'(sigmoid(x)) * sigmoid'(x)

You see that gradient from sigmoid is multiplied with the gradient from relu. Note also that relu calculates its gradient with respect to the output of sigmoid, whilst sigmoid calculates its gradient with respect to the input (x).

answered Mar 24 '21 at 12:33

Kevin

3,096
2
8
37

which intermediate values do I need. I am currently saving in memory the weights and the output of each layer. do I also need to save the value before activation? – unaied Mar 24 '21 at 14:59
Yes, the input (or the output of the previous layer) is used in back-propagation. For binary operations (i.e. addition or multiplication) you will have two inputs and that will require you to back-propagate two values. This might be easier to explain with an example of what your network looks like. – Kevin Mar 24 '21 at 17:24
I am not sure how to share my code but basically I am working with the MNIST data. MY architecture is setup with 2 hidden layers (10,30) and one output layer with 10 neurons. I have updated my activation to relu from sigmoid and also modified the backprop phase so the hidden error takes into account the summed weights before applying relu activation during the forward phase. Accuracy is low and I feel like theres more updates to the code needed – unaied Mar 24 '21 at 18:01
Each hidden layer will typically multiply the input with some weight, add the bias and pass this through an activation function, i.e. ```f(Wx + b)``` where ```f``` is activation function, ```W``` is the weight and ```b``` is the bias. If you understand how this is a composed function you are able to calculate the derivative which can easily be extended on other hidden layers. There are tons of resources about this online, but it is hard to give any specific answer without any more specific details about what you are trying to accomplish. – Kevin Mar 24 '21 at 18:27

Relu Activation and Backpropagation

2 Answers2