outputs after back propagation converge to a small value 0.01

Question

the below code is my main code block

iter_pos=0
max_iter=120 
iter_cost=[]
parameters=generate_parameters()
while iter_pos<max_iter:
    y_pred = forward_prop(x_train, parameters)
    cost_value = cost(y_hat=y_pred, y=multi_class_y_train)
    iter_cost.append(cost_value)

    delta_para = back_prop(parameters, multi_class_y_train, y_pred)
    parameters=update_parameters(parameters,delta_para)

    print(iter_pos, cost_value)
    iter_pos+=1

now this is my forward prop algorithm

def forward_prop(x_input, parameter):
    a=x_input
    nodes_values[f'l{1}']=a
    for pos in range(1,n_layers):
        w = parameter[f'w{pos}']
        b=parameter[f'b{pos}']
        z=np.dot(w,a)+b
        a=sigmoid(z)
        nodes_values[f'l{pos+1}']=a
    return a

now comes the main back prop I guess I have done mistake here only

def back_prop(parameters, y_true, y_pred):
    delta = (nodes_values[f'l{n_layers}']-y_true)
    delta_para={}
    delta_para[f'delW{n_layers-1}']=np.dot(delta, nodes_values[f'l{n_layers-1}'].T)*lr/m
    delta_para[f'delB{n_layers-1}']=(np.sum(delta, axis=1, keepdims=True))*lr/m
    for pos in range(n_layers-1,1,-1):
        a=nodes_values[f'l{pos}']
        x=nodes_values[f'l{pos-1}']
        delta=np.dot(parameters[f'w{pos}'].T, delta)*((a)*(1-a))
        delta_para[f'delW{pos-1}']=np.dot(delta, x.T)*lr/m
        delta_para[f'delB{pos-1}']=np.sum(delta, axis=1, keepdims=True)*lr/m
    return delta_para

after getting all my gradients I am going to update them

def update_parameters(parameters, delta_para):
    for pos in range(n_layers-1,0,-1):
        parameters[f'w{pos}']-=delta_para[f'delW{pos}']
        parameters[f'b{pos}']-=delta_para[f'delB{pos}']
    return parameters

these are my main code blocks if required I will provide my complete code, please someone suggest what might be the issue

by "outputs" do you mean the prediction value becomes small where it always predicts 0.01? — rcshon, Apr 15 '22 at 06:07
by final output there are 26 nodes each node refers to a particular letter — pytherhub, Apr 15 '22 at 06:48
but I am getting value of each node some what equal to zero, as ideally I should be getting one value some what equal to one but this is certainly not happening — pytherhub, Apr 15 '22 at 06:50
I am stuck on the same issue for past 4 days but still no clue — pytherhub, Apr 15 '22 at 06:50
so this is a multi-class classification problem? Are the classes mutually-exclusive? You should be using softmax instead of sigmoid at the last layer if the classes are mutually-exclusive — rcshon, Apr 15 '22 at 06:54
they are mutually exclusive it is EMNIST data set for letters a,b,c,d.... — pytherhub, Apr 15 '22 at 07:06
so there is no issue in the implementation of back prop. algorithm? — pytherhub, Apr 15 '22 at 07:06
I have use one hot encoder to break the numbers (like 1 mean a, 2=b, 7=g and so on) so I have made 7 as (0,0,0,0,0,0,1,0,0......) there are 6 zeroes on left of 1 and 19 on the right — pytherhub, Apr 15 '22 at 07:08
Yes, one hot encoding is the standard practice. Your math in the computation of gradients (in your backprop function) looks correct to me. My suspicion is the `sigmoid` at the last output layer -- change it to `softmax` and let us know how it goes — rcshon, Apr 15 '22 at 07:16
@rcshon, well I have never tried softmax function so it would take me some time to understand and implement it, will definitely try and provide updates — pytherhub, Apr 15 '22 at 07:37
In the mean time if you find any mistake or bug or any thing that you feel I might have done wrong please do throw some light on it, currently I am doing my work on google collab so if you need I can add you as a collaborator you can run the code — pytherhub, Apr 15 '22 at 07:40
and see for yourself what might be the mistake as I am totally exhausted and frustrated because everything looks completely fine to me in the code but the output is coming wrong even the cost fucntion is decreasing — pytherhub, Apr 15 '22 at 07:41
you can use softmax from the scipy library or just implement the simple formula yourself. I don't think anyone here on SO will go to the extent of collaborating on your project so feel free to ask a new question providing more details and showing us your outputs after you have narrowed down which part is the problem. I have left an answer below if it helps — rcshon, Apr 15 '22 at 08:01

score 0 · Answer 1 · answered Apr 15 '22 at 08:00

As discussed in the comments, your issue is with using sigmoid on the final layer instead of softmax on a multi-class mutually-exclusive classification problem. A quick fix will be to just import the softmax function from scipy.special and use it in the last layer:

def forward_prop(x_input, parameter):
    a=x_input
    nodes_values[f'l{1}']=a
    for pos in range(1,n_layers):
        w = parameter[f'w{pos}']    
        b=parameter[f'b{pos}']
        z=np.dot(w,a)+b
    
        # Use softmax if this is the last layer 
        if pos == n_layers - 1:
            a = softmax(z)
        
        # Use your choice of activation function otherwise (sigmoid in your case)
        else:
            a=sigmoid(z, axis=0)

        nodes_values[f'l{pos+1}']=a

    return a

You can of course, define your own softmax as its pretty simple:

def softmax(z, axis=0):
    exp = np.exp(z)
    return exp / np.sum(exp, axis=0)

outputs after back propagation converge to a small value 0.01

1 Answers1