Your question is about the fundamentals of neural networks and therefore I strongly suggest you start here ( Michael Nielsen's book ).
It is python-oriented book with graphical, textual and formulated explanations - great for beginners. I am confident that you will find this book useful for your understanding. Look for chapters 2 and 3 to address your problems.
Addressing your question about the Sigmoids, it is possible to use it for multiclass predictions, but not recommended. Consider the following facts.
Sigmoids are activation functions of the form 1/(1+exp(-z))
where z
is the scalar multiplication of the previous hidden layer (or inputs) and a row of the weights matrix, in addition to a bias (reminder: z=w_i . x + b
where w_i
is the i
-th row of the weight matrix ). This activation is independent of the others rows of the matrix.
Classification tasks are regarding categories. Without any prior knowledge ,and even with, most of the times, categories have no order-value interpretation; predicting apple
instead of orange
is no worse than predicting banana
instead of nuts
. Therefore, one-hot
encoding for categories usually performs better than predicting a category number using a single activation function.
To recap, we want an output layer with number of neurons equals to number of categories, and sigmoids are independent of each other, given the previous layer values. We also would like to predict the most probable category, which implies that we want the activations of the output layer
to have a meaning of probability disribution
. But Sigmoids are not guaranteed to sum to 1, while softmax activation does.
Using L2-loss
function is also problematic due to vanishing gradients issue. Shortly, the derivative of the loss is (sigmoid(z)-y) . sigmoid'(z)
(error times the derivative), that makes this quantity small, even more when the sigmoid is closed to saturation. You can choose cross entropy
instead, or a log-loss
.
EDIT:
Corrected phrasing about ordering the categories. To clarify, classification is a general term for many tasks related to what we used today as categorical predictions for definite finite sets of values. As of today, using softmax in deep models to predict these categories in a general "dog/cat/horse" classifier, one-hot-encoding and cross entropy is a very common practice. It is reasonable to use that if the aforementioned is correct. However, there are (many) cases it doesn't apply. For instance, when trying to balance the data. For some tasks, e.g. semantic segmentation tasks, categories can have ordering/distance between them (or their embeddings) with meaning. So please, choose wisely the tools for your applications, understanding what their doing mathematically and what their implications are.