3

I am currently trying to use Neural Network to make regression predictions.

However, I don't know what is the best way to handle this, as I read that there were 2 different ways to do regression predictions with a NN.

1) Some websites/articles suggest to add a final layer which is linear. http://deeplearning4j.org/linear-regression.html

My final layers would look like, I think, :

layer1 = tanh(layer0*weight1 + bias1)

layer2 = identity(layer1*weight2+bias2)

I also noticed that when I use this solution, I usually get a prediction which is the mean of the batch prediction. And this is the case when I use tanh or sigmoid as a penultimate layer.

2) Some other websites/articles suggest to scale the output to a [-1,1] or [0,1] range and to use tanh or sigmoid as a final layer.

Are these 2 solutions acceptable ? Which one should one prefer ?

Thanks, Paul

Paul Rolin
  • 403
  • 2
  • 6
  • 15

1 Answers1

8

I would prefer the second case, in which we use normalization and sigmoid function as the output activation and then scale back the normalized output values to their actual values. This is because, in the first case, to output the large values (since actual values are large in most cases), the weights mapping from penultimate layer to the output layer would have to be large. Thus, for faster convergence, the learning rate has to be made larger. But this may also cause learning of the earlier layers to diverge since we are using a larger learning rate. Hence, it is advised to work with normalized target values, so that the weights are small and they learn quickly. Hence in short, the first method learns slowly or may diverge if a larger learning rate is used and on the other hand, the second method is comparatively safer to use and learns quickly.

Nagabhushan Baddi
  • 1,164
  • 10
  • 18
  • Sorry for my late answer, but thanks for your answer. I tried scaling the output values and then descaling them, both in the range [0,1] for sigmoid and [-1,1] for tanh. The difference in the results for the accuracy is not incredible but the model seems to never diverge with scaling outputs, while it is often diverging if I use non-scaled values. – Paul Rolin Jul 01 '16 at 18:27
  • Instead of this. What if we use relu layer at the end? Will that make any difference? – faizan Jul 12 '18 at 14:34