0

I'm using a MLP with Keras, optimized with sgd. I want to tune the learning rate but it seems to have no effect whatsoever on training. I tried small learning rates (.01) as well as very large (up to 1e28), and the effects are barely notable. Shouldn't my loss explode when using a very large learning rate ?

I'm using a fully-connected NN with 3 hidden layers and sigmoid activation function. Loss is a variant of BinaryCrossEntropy. The goal is to predict credit default. Training set contains 500000 examples, with approx. 2% of defauts. Test set contains 200000 lines


def loss_custom_w(p):
    def loss_custom(y,yhat):
        y_l, y_lhat = keras.backend.flatten(y),keras.backend.flatten(yhat)
        eps = keras.backend.epsilon()
        y_lhat = keras.backend.clip(y_lhat, eps, 1-eps)

        return - keras.backend.mean(p*y_l*keras.backend.log(y_lhat) + (1-y_l)*keras.backend.log(1-y_lhat))
    return loss_custom

model = keras.Sequential([keras.layers.Dense(n_input), keras.layers.Dense(500, activation = 'sigmoid'), keras.layers.Dense(400, activation = 'sigmoid'), keras.layers.Dense(170, activation = 'sigmoid'), keras.layers.Dense(120, activation = 'sigmoid'), keras.layers.Dense(1, activation = 'sigmoid')])
sgd = keras.optimizers.SGD(lr = 1e20)
model.compile(optimizer = sgd, loss = loss_custom_w(8))
model.fit(x_train, y_train, epochs = 10, batch_size = 1000)

Update : - I’ve tried changing activation functions to avoid vanishing gradients, but it did not work.

  • the problem does not come from the loss function (I tried other losses too).

  • actually the networks seems to work well, as well as the custom loss. When I change the value of p, it does what it’s expected to do. I just can’t figure out why the learning rate has no effects. The classifier also gives satisfying results.

  • The network manages to predits labels from both classes. It predict better the 1 class when i use a large penalty value (as expected)

R B
  • 1
  • 1
  • Can you provide more details on the data you are using and some code samples? – jawsem Mar 28 '20 at 15:43
  • Yes. I'm using a fully-connected NN with 3 hidden layers and sigmoid activation function. Loss is a variant of BinaryCrossEntropy. The goal is to predict credit default. Training set contains 500000 examples, with approx. 2% of defauts. Test set contains 200000 lines. – R B Mar 28 '20 at 15:49
  • Thank for the additional detail. Can you provide the actual code you wrote in keras or at least generally what it looks like? And if possible can you add that detail to the original question. I will admit I am not too familiar with Keras and was more curious about the answer to this question vs answering it. If you provide more detail someone else should be able to answer it. – jawsem Mar 28 '20 at 15:56
  • There are four hidden layers ,which are using 'sigmoid' activation, with 2 input and output layers may lead to gradient vanishing. it means the gradient cannot reach to primary layers. 1- decrease your hidden layer to two or three hidden layers 2- change your activation function to 'Relu'. I hope it can help you. – Matin Shokri Mar 28 '20 at 16:45
  • Thanks. I changed my activation functions to ReLU to avoid vanishing gradients, but I didn't work (learning rate still has no effect). I also tried other activation function, and still no effect. – R B Mar 28 '20 at 17:53
  • Not sure what your `return loss_custom` actually returns, since you don't seem to define `loss_custom` anywhere in your function. In any case, try with the standard binary cross entropy loss first (doesn't look very different from your custom one, except for `p`) and all `sigmoid` to `relu` (except for the final layer); if it works, you will know that the issue is in your custom loss – desertnaut Mar 29 '20 at 03:12
  • Does your model predict labels of both classes? 2% against 98% is very unbalanced. Maybe your loss does not change because for all learning rates the produced outcome is the same. – Ach113 Mar 30 '20 at 10:24

1 Answers1

0

Finally I got it. I did not specify the input shape in my model (left the 'input_shape' keyword argument of the first layer to 'None'). When I specified it it suddenly worked. I do not really understand why specifying the input shape is so important.

R B
  • 1
  • 1