2

I am going to implement binary addition by Recurrent Neural Network (RNN) as a sample. I have coped with an issue to implement it by Python, so I decided to share my problem in there to come up with ideas to fix it.

As can be seen in my notebook code (Backpropagation through time (BPTT) section), There is a chain rule like below to update input weight matrix like below:

chain rule

My problem is this part:

w_input

I've tried to implement this part in my Python code or notebook code (class input_layer, backward method), but unmatched dimensions raises an error.

In my sample code, W_hidden is 16*16, whereas the result of delta pre_hidden is 1*2. This makes the error. If you run the code, you could see the error.

I spent a lot of time to check my chain rule as well as my code. I guess my chain rule is right. Only reason to make this error is my code.

As I know, multiple unmatched matrices in terms of dimension is impossible. If my chain rule is correct, how it could be implemented by Python? Any idea?

Thanks in advance.

Ali Soltani
  • 9,589
  • 5
  • 30
  • 55

1 Answers1

0

You need to apply dimension balancing on the gradients. Taken from the Stanford's cs231n course, it comes down to two simple modifications:

Given , and , we will have:

,

Here is the code I used to ensure the gradient calculation is correct. You should be able to update your code accordingly.

import torch

torch.random.manual_seed(0)

x_1, x_2 = torch.zeros(size=(1, 8)).normal_(0, 0.01), torch.zeros(size=(1, 8)).normal_(0, 0.01)
y = torch.zeros(size=(1, 8)).normal_(0, 0.01)

h_0 = torch.zeros(size=(1, 16)).normal_(0, 0.01)
weight_ih = torch.zeros(size=(8, 16)).normal_(mean=0, std=0.01).requires_grad_(True)
weight_hh = torch.zeros(size=(16, 16)).normal_(mean=0, std=0.01).requires_grad_(True)
weight_ho = torch.zeros(size=(16, 8)).normal_(mean=0, std=0.01).requires_grad_(True)

h_1 = x_1.mm(weight_ih) + h_0.mm(weight_hh)
h_2 = x_2.mm(weight_ih) + h_1.mm(weight_hh)
g_2 = h_2.sigmoid()
j_2 = g_2.mm(weight_ho)
y_predicted = j_2.sigmoid()

loss = 0.5 * (y - y_predicted).pow(2).sum()

loss.backward()


delta_1 = -1 * (y - y_predicted) * y_predicted * (1 - y_predicted)
delta_2 = delta_1.mm(weight_ho.t()) * (g_2 * (1 - g_2))
delta_3 = delta_2.mm(weight_hh.t())

# 16 x 8
weight_ho_grad = g_2.t() * delta_1

# 16 x 16
weight_hh_grad = h_1.t() * delta_2 + (h_0.t() * delta_3)

# 8 x 16
weight_ih_grad = x_2.t() * delta_2 + x_1.t() * delta_3

atol = 1e-10
assert torch.allclose(weight_ho.grad, weight_ho_grad, atol=atol)
assert torch.allclose(weight_hh.grad, weight_hh_grad, atol=atol)
assert torch.allclose(weight_ih.grad, weight_ih_grad, atol=atol)
Mohammad Arvan
  • 603
  • 3
  • 11