It sounds like you are performing a regression task since you describe your final output as, "the untransformed actual value (y) (which can be any number as it is not been subjected to the Relu activation function)."
In that case, you will not use an activation function on your final output layer of the neural network, because, just as you point out, the prediction is not intended to be constrained to any particular activated region of the real numbers... it is allowed to be any real number (and the model will use the gradient of the loss function to adjust parameters in earlier layers of the network to achieve accuracy in that creation of some "any number" final output value).
For an example, see the Basic Regression TensorFlow Keras tutorial. You can see from the model layer definitions:
def build_model():
model = keras.Sequential([
layers.Dense(64, activation=tf.nn.relu, input_shape=[len(train_dataset.keys())]),
layers.Dense(64, activation=tf.nn.relu),
layers.Dense(1)
])
optimizer = tf.train.RMSPropOptimizer(0.001)
model.compile(loss='mse',
optimizer=optimizer,
metrics=['mae', 'mse'])
return model
It is using a mean-squared error loss, and the final layer is just a plain Dense(1)
value, with no activation.
In cases when the output is a binary classification or multi-label classification prediction, then you will still apply an activation to the final layer, and it will transform the value into a relative score that indicates the model's prediction about each category.
So for example if you wanted to predict a label for a 4-category prediction task, your output layer would be something like Dense(4, activation=tf.nn.softmax)
, where the softmax activation converts the raw neuron values of those 4 neurons into relative scores.
It's typical to associate the highest scoring output neuron in that case with the predicted category label. However, categorical loss functions, like cross entropy loss, will utilize the relative values of the scores for all neurons as a way to dole out loss in accordance with the degree of an accurate prediction, rather than a 0-1 loss which would give maximum loss for any incorrect prediction, regardless of how close or far it was from being correct.