0

I decided to code backpropagation through time algo in order to train a rnn without lstm and without bias. I am using cross entropy loss function and tanh activation fn at hidden layer and softmax activation function at output layer. The rnn consists of one input, one hidden and one output layer.

I wrote the code and it is working fine and loss is decreasing over epochs (I decided to take only around 20 examples in one epoch, just for testing). Moreover I compared my derivatives with the bptt code written in this blog and it matches out perfectly. But the main problem is that my implementation of bptt works very slow.

for the reference, below is my bptt implementation:

def backward(self, X, Y):
        time_steps = len(X)
        self.dLdw_hx = np.zeros(self.w_hx.shape, dtype=float)
        self.dLdw_oh = np.zeros(self.w_oh.shape, dtype=float)
        self.dLdw_hh = np.zeros(self.w_hh.shape, dtype=float)
        # taking base cases to be zero

        dhprevdw_hx = np.zeros((self.num_hidden*self.num_input, self.num_hidden), dtype=float)
        dhprevdw_hh = np.zeros((self.num_hidden*self.num_hidden, self.num_hidden), dtype=float)

        for t in range(time_steps):
            y_hat_y = np.array(self.o_time_steps[t].reshape((self.num_output, 1)))
            y_hat_y[Y[t]] -= 1.0
            dldw_oh = np.matmul(y_hat_y, self.h_time_steps[t].reshape((1, self.num_hidden)))
    
            tanh_diff = np.ones(self.num_hidden) - np.power(self.h_time_steps[t], 2)

            dhdw_hx = np.matmul(dhprevdw_hx, self.w_hh.T)
            temp = np.matmul(self.w_oh.T, y_hat_y)

            for i in range(self.num_hidden):
                row_start = i*self.num_input
                dhdw_hx[row_start + X[t], i] += 1.0
            dhdw_hx *= tanh_diff
            dldw_hx = np.matmul(dhdw_hx, temp).reshape(self.w_hx.shape)
            dhprevdw_hx = dhdw_hx

            dhdw_hh = np.matmul(dhprevdw_hh, self.w_hh.T)
            for i in range(self.num_hidden):
                row_start = i*self.num_hidden
                row_end = i*self.num_hidden + self.num_hidden
                dhdw_hh[row_start:row_end, i] += self.h_time_steps[t-1]
            dhdw_hh *= tanh_diff
            dldw_hh = np.matmul(dhdw_hh, temp).reshape(self.w_hh.shape)
            dhprevdw_hh = dhdw_hh

            self.dLdw_oh += dldw_oh
            self.dLdw_hx += dldw_hx
            self.dLdw_hh += dldw_hh
  • X = input sequence (a vector), Y = target sequence (a vector). For ex : X = [2,6,7,8], Y = [6,7,8,1]
  • L = total loss after all time steps
  • w_hx = input to hidden weights
  • w_oh = hidden to output weights,
  • w_hh = hidden to hidden weights,
  • dadb = derivative of 'a' w.r.t 'b'
  • dhprevdw_hx = derivative of previous hidden nodes w.r.t w_hx
  • dhprevdw_hh = derivative of previous hidden nodes w.r.t w_hh
  • o_time_steps = matrix of dim time_steps*ouput_nodes, basically o_time_steps[t] gives us the ouput vector at time steps 't'
  • h_time_steps = matrix of dim time_steps*hidden_nodes, basically h_time_steps[t] gives us the hidden vector at time steps 't'

In the outer for loop we iterate over individual timestep and calculate dldw_hx, dldw_oh and dldw_hh, where l is the loss at that timestep.

And finally we add each timesteps derivatives into appropriate matrices: dLw_hx, dLw_oh and dLw_hh. We use this values to update the weights of the neural network.

Let wx be the weight between first input node and first hidden node. If we just consider wx, then we need to find dl_tdwx.

We know that in order to find dl_tdwx we need to find dh_tdwx (h_t is the hidden values at time step 't').

Here h_t comprises of h1_t, h2_t, h3_t .....hn_t where n is the number of hidden nodes. In order to find dh1_tdwx, we need to find dh1_t-1dwx, dh2_t-1dwx...dhn_t-1dwx as h1_t depends upon values of h1t-1, h2t-1.....hnt-1.

So we need to store dh1_t-1dwx, dh2_t-1dwx...dhn_t-1dwx in order to find dh1_tdwx, similary we can say for dh2_tdwx...,dhn_tdwx and also for all the weights between input and hidden layer.

So in order to store those previous timestep derivates of hidden layer w.r.t to w_hx I have created a matrix dhprevdw_hx which is of dim ((hidden_nodes*input_nodes)*hidden_nodes) which is of space complexity O(n^3). I believe this is making the algo slow.

In the same way we can also say for dhprevdw_hh.

It would be great if someone could point out the problem with my implementation.

I think the code or the way I am calculating derivates can be updated.

gurjyot
  • 1
  • 2

0 Answers0