The neural network after several training epochs has too large sigmoid values and does not learn

Question

I'm implementing a fully connected neural network for MNIST (not convolutional!) and I'm having a problem. When I make multiple forward passes and backward passes, the exponents get abnormally high and python is unable to calculate them. It seems to me that I incorrectly registered backward_pass. Could you help me with this. Here are the network settings:

w_1 = np.random.uniform(-0.5, 0.5, (128, 784))
b_1 = np.random.uniform(-0.5, 0.5, (128, 1))
w_2 = np.random.uniform(-0.5, 0.5, (10, 128))
b_2 = np.random.uniform(-0.5, 0.5, (10, 1))

X_train shape:  (784, 31500)
y_train shape:  (31500,)
X_test shape:  (784, 10500)
y_test shape:  (10500,)

def sigmoid(x, alpha):
    return 1 / (1 + np.exp(-alpha * x))

def dx_sigmoid(x, alpha):
    exp_neg_x = np.exp(-alpha * x)

    return alpha * exp_neg_x / ((1 + exp_neg_x)**2)

def ReLU(x):
    return np.maximum(0, x)

def dx_ReLU(x):
    return np.where(x > 0, 1, 0)

def one_hot(y):
    one_hot_y = np.zeros((y.size, y.max() + 1))
    one_hot_y[np.arange(y.size), y] = 1
    one_hot_y = one_hot_y.T
    
    return one_hot_y

def forward_pass(X, w_1, b_1, w_2, b_2):
    layer_1 = np.dot(w_1, X) + b_1
    layer_1_act = ReLU(layer_1)

    layer_2 = np.dot(w_2, layer_1_act) + b_2
    layer_2_act = sigmoid(layer_2, 0.01)
    
    return layer_1, layer_1_act, layer_2, layer_2_act

def backward_pass(layer_1, layer_1_act, layer_2, layer_2_act, X, y, w_2):
    one_hot_y = one_hot(y)
    n_samples = one_hot_y.shape[1]

    d_loss_by_layer_2_act = (2 / n_samples) * np.sum(one_hot_y - layer_2_act, axis=1).reshape(-1, 1)

    d_layer_2_act_by_layer_2 = dx_sigmoid(layer_2, 0.01)
    d_loss_by_layer_2 = d_loss_by_layer_2_act * d_layer_2_act_by_layer_2
    d_layer_2_by_w_2 = layer_1_act.T

    d_loss_by_w_2 = np.dot(d_loss_by_layer_2, d_layer_2_by_w_2)
    d_loss_by_b_2 = np.sum(d_loss_by_layer_2, axis=1).reshape(-1, 1)

    d_layer_2_by_layer_1_act = w_2.T
    d_loss_by_layer_1_act = np.dot(d_layer_2_by_layer_1_act, d_loss_by_layer_2)
    d_layer_1_act_by_layer_1 = dx_ReLU(layer_1)
    d_loss_by_layer_1 = d_loss_by_layer_1_act * d_layer_1_act_by_layer_1
    d_layer_1_by_w_1 = X.T

    d_loss_by_w_1 = np.dot(d_loss_by_layer_1, d_layer_1_by_w_1)
    d_loss_by_b_1 = np.sum(d_loss_by_layer_1, axis=1).reshape(-1, 1)

    return d_loss_by_w_1, d_loss_by_b_1, d_loss_by_w_2, d_loss_by_b_2

for epoch in range(epochs):
    layer_1, layer_1_act, layer_2, layer_2_act = forward_pass(X_train, w_1, b_1, w_2, b_2)

    d_loss_by_w_1, d_loss_by_b_1, d_loss_by_w_2, d_loss_by_b_2 = backward_pass(layer_1, layer_1_act,
                                                                               layer_2, layer_2_act,
                                                                               X_train, y_train,
                                                                               w_2)

    w_1 -= learning_rate * d_loss_by_w_1
    b_1 -= learning_rate * d_loss_by_b_1
    w_2 -= learning_rate * d_loss_by_w_2
    b_2 -= learning_rate * d_loss_by_b_2

    _, _, _, predictions = forward_pass(X_train, w_1, b_1, w_2, b_2)
    predictions = predictions.argmax(axis=0)

    accuracy = accuracy_score(predictions, y_train)

    print(f"epoch: {epoch} / acuracy: {accuracy}")

My loss is MSE: (1 / n_samples) * np.sum((one_hot_y - layer_2_act)**2, axis=0)

This is my calculations calculations

I tried to decrease lr, set the alpha coefficient to the exponent (e^(-alpha * x) for sigmoid), I divided my entire sample by 255. and still the program cannot learn because the numbers are too large

@amirhm solution is incomplete, but it also improves the result, so I'll mark it as correct. The problem itself lay in the wrong gradients. — Амур Дзагкоев, Jan 16 '23 at 10:13

amirhm · Accepted Answer · 2023-01-16T16:53:06.527

0

To start the unifrom initialization you are using has a relatively big std, for linear layer you should be 1/sqrt(fin) , which for first layer will be :

1 / np.sqrt(128)
0.08838834764831843

which means:

w_1 = np.random.uniform(-0.08, 0.08, (128, 784))
...

also did not check your forward and backward path, assuming if it is correct and you see very big values in your activation, you could as well normalize (like using an implementation of batchnorm or layer norm) to force centred around zero with unit std.

P.S: also noticed you as well doing a multi-class, then MSE would not be a good choice, use Softmax or logSoftmax (easier implementation), but why loss is not moving fast enough could also be linked to not a good LR as well. and do your inputs normalized? you could plot the dist for layers and see if they are good.

edited Jan 16 '23 at 16:53

answered Jan 15 '23 at 20:54

amirhm

1,239
9
12

I did it, but my model just started to learn very, very slowly. Loss moves very badly. I saw how such models are trained, they do it many times faster, but I could not understand how they took the gradients. I wrote my gradients and checked, I believe they are all correct. Please correct me if this is not the case. this is model learning process epoch: 0 / acuracy: 0.125587301587301 epoch: 100 / acuracy: 0.1252063492063 epoch: 200 / acuracy: 0.1249841269841 epoch: 300 / acuracy: 0.1245396825396 epoch: 400 / acuracy: 0.1243809523809 epoch: 500 / acuracy: 0.1244126984126 – Амур Дзагкоев Jan 15 '23 at 21:12
Now i noticed you as well doing a multi class , then mse would not be good choice, use softmax or logsoftmax (easier implementation) , but why loss is not moving fast enough could also be linked not good lr as well. And btw you inputs are normalized? Also you could plot the dist for layers and see it they are good. And i believe the post as well answered your main question, you could then vote up! – amirhm Jan 15 '23 at 22:12
The solution was easier. I accidentally transposed something I didn't need. But I will mark your answer as correct, since softmax also helped to fix the error. Thank you! – Амур Дзагкоев Jan 16 '23 at 10:12
@АмурДзагкоев glad that help, anytime, I included that part of comment in answer for someone if looked and did not looked at comments, – amirhm Jan 16 '23 at 16:49

The neural network after several training epochs has too large sigmoid values ​and does not learn

1 Answers1

The neural network after several training epochs has too large sigmoid values and does not learn