Role of activation function in calculating the cost function for artificial neural networks

Question

I have some difficulty with understanding the role of activation functions and cost functions. Lets take a look at a simple example. Lets say I am building a neural network (artificial neural network). I have 5 „x“ variables and one „y“ variable.

If I do usual feature scaling and then apply, for example, Relu activation function in hidden layer, then this activation function does the transformation and as a result we get our predicted output value (y hat) between 0 and lets say M. Then the next step is to calculate the cost function.

In calculating the cost function, however, we need to compare the output value (y hat) with the actual value (y).

The question is how we can compare transformed output value (y hat) which is lets say between 0 and M with the untransformed actual value (y) (which can be any number as it is not been subjected to the Relu activation function) to calculate the cost function? There can be a large mismatch as one variable has been exposed to transformation and the other has not been.

Thank you for any help.

ely · Accepted Answer · 2019-01-10T14:32:20.887

It sounds like you are performing a regression task since you describe your final output as, "the untransformed actual value (y) (which can be any number as it is not been subjected to the Relu activation function)."

In that case, you will not use an activation function on your final output layer of the neural network, because, just as you point out, the prediction is not intended to be constrained to any particular activated region of the real numbers... it is allowed to be any real number (and the model will use the gradient of the loss function to adjust parameters in earlier layers of the network to achieve accuracy in that creation of some "any number" final output value).

For an example, see the Basic Regression TensorFlow Keras tutorial. You can see from the model layer definitions:

def build_model():
  model = keras.Sequential([
    layers.Dense(64, activation=tf.nn.relu, input_shape=[len(train_dataset.keys())]),
    layers.Dense(64, activation=tf.nn.relu),
    layers.Dense(1)
  ])

  optimizer = tf.train.RMSPropOptimizer(0.001)

  model.compile(loss='mse',
                optimizer=optimizer,
                metrics=['mae', 'mse'])
  return model

It is using a mean-squared error loss, and the final layer is just a plain Dense(1) value, with no activation.

In cases when the output is a binary classification or multi-label classification prediction, then you will still apply an activation to the final layer, and it will transform the value into a relative score that indicates the model's prediction about each category.

So for example if you wanted to predict a label for a 4-category prediction task, your output layer would be something like Dense(4, activation=tf.nn.softmax), where the softmax activation converts the raw neuron values of those 4 neurons into relative scores.

It's typical to associate the highest scoring output neuron in that case with the predicted category label. However, categorical loss functions, like cross entropy loss, will utilize the relative values of the scores for all neurons as a way to dole out loss in accordance with the degree of an accurate prediction, rather than a 0-1 loss which would give maximum loss for any incorrect prediction, regardless of how close or far it was from being correct.

Thank you. May I get your opinion regarding the following point ? Let’s say I use Tanh activation function in the hidden layer and then I do not use any activation function on the output layer. This will force all values coming out of the hidden layer (hence, „predicted y“ values or „y hat“) to fluctuate between -1 and 1. However, true „y“ values will not have this boundary. In this case, can I expect to have a reasonable outcome or convergence ? My understanding is that the cost function in this case will generate a large error (large difference between predict y and actual y). — Emil, Jan 10 '19 at 20:35
@Emil if you do not use an activation function on the output layer, then the output will not be constrained to the range [-1, 1]. The output of the hidden layer would be, which is the *input* to the final layer (it is not the overall output that is compared with a loss function). — ely, Jan 11 '19 at 13:32

score 0 · Answer 2 · answered Jun 23 '20 at 10:20

-A cost function is a measure of error between what value your model predicts and what the value actually is. For example, say we wish to predict the value yi for data point xi . Let fθ(xi) represent the prediction or output of some arbitrary model for the point xi with parameters θ . One of many cost functions could be

∑ni=1(yi−fθ(xi))2

this function is known as the L2 loss. Training the hypothetical model we stated above would be the process of finding the θ that minimizes this sum.

-An activation function transforms the shape/representation of the data going into it. A simple example could be max(0,xi) , a function which outputs 0 if the input xi is negative or xi if the input xi is positive. This function is known as the “ReLU” or “Rectified Linear Unit” activation function. The choice of which function(s) are best for a specific problem using a particular neural architecture is still under a lot of discussion. However, these representations are essential for making high-dimensional data linearly separable, which is one of the many uses of a neural network.

I hope this gave a decent idea of what these things are. If you wish to learn more, I suggest you go through Andrew Ng’s machine learning course on Coursera. It provides a wonderful introductory look into the field.

score -1 · Answer 3 · answered Jan 10 '19 at 14:36

The value you're comparing your actual results to for the cost function doesn't (intrinsically) have anything to do with the input you used to get the output. It doesn't get transformed in any way.

Your expected value is [10,200,3] but you used Softmax on the output layer and RMSE loss? Well, too bad, you're gonna have a high cost all the time (and the model probably won't converge).

It's just on you to use the right cost functions to serve as a sane heuristic for evaluating the model performance and the right activations to be able to get sane outputs for the task at hand.

Role of activation function in calculating the cost function for artificial neural networks

3 Answers3