Sigmoid activation for multi-class classification?

Question

I am implementing a simple neural net from scratch, just for practice. I have got it working fine with sigmoid, tanh and ReLU activations for binary classification problems. I am now attempting to use it for multi-class, mutually exclusive problems. Of course, softmax is the best option for this.

Unfortunately, I have had a lot of trouble understanding how to implement softmax, cross-entropy loss and their derivatives in backprop. Even after asking a couple of questions here and on Cross Validated, I can't get any good guidance.

Before I try to go further with implementing softmax, is it possible to somehow use sigmoid for multi-class problems (I am trying to predict 1 of n characters, which are encoded as one-hot vectors)? And if so, which loss function would be best? I have been using the squared error for all binary classifications.

I'm not entirely clear what you're looking for. You don't need a multi-class classifier as your activation function fro your neural net itself to produce muli-class classifications. The way to construct a multi-class NN is not to adjust the activation function in the individual neurons, but to have the output layer contain one node per class. — Metropolis, Apr 23 '18 at 22:55
Sorry, I suppose a better way to ask my question was: since my classes are mutually exclusive and softmax would clearly be the best as you would take the class with the highest probability as the prediction, is it still ok to use sigmoid (where the outputs won't be mutually exclusive probabilities) and just take the output with the highest sigmoid value? Can the neural net learn to work this way effectively? — KOB, Apr 23 '18 at 22:58
You can certainly create a multi-class classifier by running a bunch of one-vs-all classifiers and picking the highest scoring class across those. My guess is this would be pretty hard to apply back-propagation to. Softmax is differentiable and should be easier to use in a NN. — Metropolis, Apr 23 '18 at 23:18
@Metropolis I know this is out of the scope of the question, but if you are willing, I would love to send you my code in chat of my attempted implementation using softmax, to see if you can improve on it? — KOB, Apr 23 '18 at 23:25
I'm trying to implement a simple NN from `numpy` for MNIST digit classification so I assume we are doing the same thing. My hidden layers all use sigmoid (`scipy.special.expit`). I noticed that when the last output layer also uses sigmoid, my gradient descent doesn't work, the loss increases across epochs. And this only happens to my batch-train, and mini-batch-train works quite well. But when I changed the last layer activation to softmax, batch, mini-batch and stochastic modes all worked. I still don't quite understand why, maybe it has something to do with the summation to 1 in softmax. — Jason, May 14 '21 at 01:54

mr_mo · Accepted Answer · 2021-05-14T17:14:36.843

Your question is about the fundamentals of neural networks and therefore I strongly suggest you start here ( Michael Nielsen's book ). It is python-oriented book with graphical, textual and formulated explanations - great for beginners. I am confident that you will find this book useful for your understanding. Look for chapters 2 and 3 to address your problems.

Addressing your question about the Sigmoids, it is possible to use it for multiclass predictions, but not recommended. Consider the following facts.

Sigmoids are activation functions of the form 1/(1+exp(-z)) where z is the scalar multiplication of the previous hidden layer (or inputs) and a row of the weights matrix, in addition to a bias (reminder: z=w_i . x + b where w_i is the i-th row of the weight matrix ). This activation is independent of the others rows of the matrix.

Classification tasks are regarding categories. Without any prior knowledge ,and even with, most of the times, categories have no order-value interpretation; predicting apple instead of orange is no worse than predicting banana instead of nuts. Therefore, one-hot encoding for categories usually performs better than predicting a category number using a single activation function.

To recap, we want an output layer with number of neurons equals to number of categories, and sigmoids are independent of each other, given the previous layer values. We also would like to predict the most probable category, which implies that we want the activations of the output layer to have a meaning of probability disribution. But Sigmoids are not guaranteed to sum to 1, while softmax activation does.

Using L2-loss function is also problematic due to vanishing gradients issue. Shortly, the derivative of the loss is (sigmoid(z)-y) . sigmoid'(z) (error times the derivative), that makes this quantity small, even more when the sigmoid is closed to saturation. You can choose cross entropy instead, or a log-loss.

EDIT: Corrected phrasing about ordering the categories. To clarify, classification is a general term for many tasks related to what we used today as categorical predictions for definite finite sets of values. As of today, using softmax in deep models to predict these categories in a general "dog/cat/horse" classifier, one-hot-encoding and cross entropy is a very common practice. It is reasonable to use that if the aforementioned is correct. However, there are (many) cases it doesn't apply. For instance, when trying to balance the data. For some tasks, e.g. semantic segmentation tasks, categories can have ordering/distance between them (or their embeddings) with meaning. So please, choose wisely the tools for your applications, understanding what their doing mathematically and what their implications are.

I didn't quite get the `apple`-`orange` part. Did you mean to say "predicting `apple` instead of `orange` is NOT worse than predicting `banana` instead of `nuts`"? — Jason, May 14 '21 at 01:44

score 0 · Answer 2 · answered Apr 23 '18 at 22:44

0

What you ask is a very broad question.

As far as I know, when the class become 2, the softmax function will be the same as sigmoid, so yes they are related. Cross entropy maybe the best loss function.
For the backpropgation, it is not easy to find the formula...there are many ways.Since the help of CUDA, I don't think it is necessary to spend much time on it if you just want to use the NN or CNN in the future. Maybe try some framework like Tensorflow or Keras(highly recommand for beginers) will help you.
There is also many other factors like methods of gradient descent, the setting of hyper parameters...

Like I said, the topic is very abroad. Why not trying the machine learning/deep learning courses on Coursera or Stanford online course?

answered Apr 23 '18 at 22:44

monkeyking9528

126
6

I have studied ML/DL quite a lot in my degree and use TensorFlow extensively for implementing models. Because of this I realised that it has been quite a while since I have seen the theory behind backprop, and would probably be lost if someone asked me to implement the algorithm from scratch. I found the activation function I listed relatively easy to implement in backprop, but have hit a brick wall with softmax. I don't want to implement a neural network from scratch so that I can actually use it in practice - I just want to implement it to become very familiar with the inner workings. – KOB Apr 23 '18 at 22:48
We happened to have such topics as our homework by using tensorflow structures while not using build-in optimize function... believe me, it is quite painful to do so in the form of matrix or one by one....anyway, good luck for you. – monkeyking9528 Apr 23 '18 at 23:00

Sigmoid activation for multi-class classification?

2 Answers2