9

I've written the following backpropagation routine for a neural network, using the code here as an example. The issue I'm facing is confusing me, and has pushed my debugging skills to their limit.

The problem I am facing is rather simple: as the neural network trains, its weights are being trained to zero with no gain in accuracy.

I have attempted to fix it many times, verifying that:

  • the training sets are correct
  • the target vectors are correct
  • the forward step is recording information correctly
  • the backward step deltas are recording properly
  • the signs on the deltas are correct
  • the weights are indeed being adjusted
  • the deltas of the input layer are all zero
  • there are no other errors or overflow warnings

Some information:

  • The training inputs are an 8x8 grid of [0,16) values representing an intensity; this grid represents a numeral digit (converted to a column vector)
  • The target vector is an output that is 1 in the position corresponding to the correct number
  • The original weights and biases are being assigned by Gaussian distribution
  • The activations are a standard sigmoid

I'm not sure where to go from here. I've verified that all things I know to check are operating correctly, and it's still not working, so I'm asking here. The following is the code I'm using to backpropagate:

def backprop(train_set, wts, bias, eta):
    learning_coef = eta / len(train_set[0])

    for next_set in train_set:
        # These record the sum of the cost gradients in the batch
        sum_del_w = [np.zeros(w.shape) for w in wts]
        sum_del_b = [np.zeros(b.shape) for b in bias]

        for test, sol in next_set:
            del_w = [np.zeros(wt.shape) for wt in wts]
            del_b = [np.zeros(bt.shape) for bt in bias]
            # These two helper functions take training set data and make them useful
            next_input = conv_to_col(test)
            outp = create_tgt_vec(sol)

            # Feedforward step
            pre_sig = []; post_sig = []
            for w, b in zip(wts, bias):
                next_input = np.dot(w, next_input) + b
                pre_sig.append(next_input)
                post_sig.append(sigmoid(next_input))
                next_input = sigmoid(next_input)

            # Backpropagation gradient
            delta = cost_deriv(post_sig[-1], outp) * sigmoid_deriv(pre_sig[-1])
            del_b[-1] = delta
            del_w[-1] = np.dot(delta, post_sig[-2].transpose())

            for i in range(2, len(wts)):
                pre_sig_vec = pre_sig[-i]
                sig_deriv = sigmoid_deriv(pre_sig_vec)
                delta = np.dot(wts[-i+1].transpose(), delta) * sig_deriv
                del_b[-i] = delta
                del_w[-i] = np.dot(delta, post_sig[-i-1].transpose())

            sum_del_w = [dw + sdw for dw, sdw in zip(del_w, sum_del_w)]
            sum_del_b = [db + sdb for db, sdb in zip(del_b, sum_del_b)]

        # Modify weights based on current batch            
        wts = [wt - learning_coef * dw for wt, dw in zip(wts, sum_del_w)]
        bias = [bt - learning_coef * db for bt, db in zip(bias, sum_del_b)]

    return wts, bias

By Shep's suggestion, I checked what's happening when training a network of shape [2, 1, 1] to always output 1, and indeed, the network trains properly in that case. My best guess at this point is that the gradient is adjusting too strongly for the 0s and weakly on the 1s, resulting in a net decrease despite an increase at each step - but I'm not sure.

  • 3
    since you haven't hard-coded the topology of the net, here's a suggestion: try training with zero or one hidden layers (each with one node), one input, and one output, to see if that does what you'd expect. – Shep May 27 '15 at 19:51
  • @Shep Interestingly, when running on a nnet with shape `[2, 1, 1]`, the weights are actually trained correctly. Training it to output 1, the weights actually _do_ increase, and the output converges at 1. Interesting test. –  May 27 '15 at 20:09
  • 1) By "the weights are being trained to zero", you mean that the weights start out as random numbers, but seem to converge to zero? 2) How large is your actual network? 3) Try scaling your inputs to the range [0, 1]. – cfh May 27 '15 at 21:30
  • @cfh Yeah, weights end up dropping to zero despite being initially random. The network I'm trying to train has the shape `[64, 15, 25, 10]`. The result doesn't change when the inputs are scaled to `[0,1]`. –  May 27 '15 at 21:43
  • How do you use the learning rate? – matousc Jun 17 '15 at 12:27
  • How you initialized weights of NN ? – Yuriy Zaletskyy Sep 10 '15 at 08:13
  • @YuraZaletskyy The weights were randomized using a normal Gaussian distribution. –  Sep 10 '15 at 08:14
  • Does weights have different signs and relatively small ( < 1) values ? – Yuriy Zaletskyy Sep 10 '15 at 08:19
  • @YuraZaletskyy The weights are all positive values, as are the inputs. The values are frequently less than one, because they follow a Gaussian distribution; however, that isn't necessarily the case. –  Sep 10 '15 at 08:20
  • can you please present sample of input -> output vector which you feed to NN? – Yuriy Zaletskyy Sep 10 '15 at 08:38
  • @YuraZaletskyy Truth be told, this question is a bit obsolete. It's since been resolved through a rewriting of the design spec and subsequent rewriting of the training method, and I no longer have access to the original code. I'm working from memory. –  Sep 10 '15 at 08:39
  • Oh, that's sad, I liked to see deeper – Yuriy Zaletskyy Sep 10 '15 at 08:40
  • @YuraZaletskyy Yeah, I understand. Part of me wishes I still had access to this code now that someone is interested in helping solve the problem ;) Thanks anyway, though! –  Sep 10 '15 at 08:41

1 Answers1

1

I suppose your problem is in choice of initial weights and in choice of initialization of weights algorithm. Jeff Heaton author of Encog claims that it as usually performs worse then other initialization method. Here is another results of weights initialization algorithm perfomance. Also from my own experience recommend you to init your weights with different signs values. Even in cases when I had all positive outputs weights with different signs perfomed better then with the same sign.

Yuriy Zaletskyy
  • 4,983
  • 5
  • 34
  • 54