How Precise Does an Activation Function Need to Be and How Large Will Its Inputs Be?

Question

I am writing a basic neural network in Java and I am writing the activation functions (currently I have just written the sigmoid function). I am trying to use doubles (as apposed to BigDecimal) with hopes that training will actually take a reasonable amount of time. However, I've noticed that the function doesn't work with larger inputs. Currently my function is:

public static double sigmoid(double t){

    return (1 / (1 + Math.pow(Math.E, -t)));

}

This function returns pretty precise values all the way down to when t = -100, but when t >= 37 the function returns 1.0. In a typical neural network when the input is normalized is this fine? Will a neuron ever get inputs summing over ~37? If the size of the sum of inputs fed into the activation function vary from NN to NN, what are some of the factors the affect it? Also, is there any way to make this function more precise? Is there an alternative that is more precise and/or faster?

score 2 · Answer 1 · edited May 23 '17 at 11:48

The surprising answer is that double is actually more precision than you need. This blog article by Pete Warden claims that even 8 bits are enough precision. And not just an academic idea: NVidia's new Pascal chips emphasize their single-precision performance above everything else, because that is what matters for deep learning training.

You should be normalizing your input neuron values. If extreme values still happen, it is fine to set them to -1 or +1. In fact, this answer shows doing that explicitly. (Other answers on that question are also interesting - the suggestion to just pre-calculate 100 or so values, and not use Math.exp() or Math.pow() at all!)

score 2 · Accepted Answer · answered Sep 27 '16 at 11:00

Yes, in a normalized network double is fine to use. But this depend on your input, if your input layer is bigger, your input sum will be bigger of course.

I have encountered the same problem using C++, after t become big, the compiler/rte does not even take into account E^-t and returns plain 1, as it only calculates the 1/1 part. I tried to divide the already normalized input by 1000-1000000 and it worked sometimes, but sometimes it did not as I was using a randomized input for the first epoch and my input layer was a matrix 784x784. Nevertheless, if your input layer is small, and your input is normalized this will help you

How Precise Does an Activation Function Need to Be and How Large Will Its Inputs Be?

2 Answers2