I am fairly new to the loss-functions and I have a 800 binary classification problem (meaning 800 neurons at the output that are not effected by eachother - probablity of each is 0 or 1). Now looking at the Documentations from: https://www.tensorflow.org/api_docs/python/tf/nn/sigmoid_cross_entropy_with_logits
It seems that it uses "logits" which are the outputs of the network with a linear activation function and the Sigmoid (needed for the binary classification) is applied in the loss-function.
I am looking at the loss-function for the soft-max activation and similar approach is applied. I am wondering why the activation function is not added to the network outputs and the loss function receives the linear outputs (logits) and in the loss function activation is applied.