Softmax derivative in NumPy approaches 0 (implementation)

Question

I'm trying to implement the softmax function for a neural network written in Numpy. Let h be the softmax value of a given signal i.

I've struggled to implement the softmax activation function's partial derivative.

I'm currently stuck at issue where all the partial derivatives approaches 0 as the training progresses. I've cross-referenced my math with this excellent answer, but my math does not seem to work out.

import numpy as np
def softmax_function( signal, derivative=False ):
    # Calculate activation signal
    e_x = np.exp( signal )
    signal = e_x / np.sum( e_x, axis = 1, keepdims = True )

    if derivative:
        # Return the partial derivation of the activation function
        return np.multiply( signal, 1 - signal ) + sum(
            # handle the off-diagonal values
            - signal * np.roll( signal, i, axis = 1 )
            for i in xrange(1, signal.shape[1] )
        )
    else:
        # Return the activation signal
        return signal
#end activation function

The signal parameter contains the input signal sent into the activation function and has the shape (n_samples, n_features).

# sample signal (3 samples, 3 features)
signal = [[0.3394572666491664, 0.3089068053925853, 0.3516359279582483], [0.33932706934615525, 0.3094755563319447, 0.3511973743219001], [0.3394407172182317, 0.30889042266755573, 0.35166886011421256]]

The following code snipped is a fully working activation function and is only included as a reference and proof (mostly for myself) that the conceptual idea actually work.

from scipy.special import expit
import numpy as np
def sigmoid_function( signal, derivative=False ):
    # Prevent overflow.
    signal = np.clip( signal, -500, 500 )

    # Calculate activation signal
    signal = expit( signal )

    if derivative:
        # Return the partial derivation of the activation function
        return np.multiply(signal, 1 - signal)
    else:
        # Return the activation signal
        return signal
#end activation function

Edit

The problem intuitively persist with simple single layer networks. The softmax (and its derivative) is applied at the final layer.

Imanol Luengo · Answer 1 · 2016-03-29T11:23:42.020

This is an answer on how to calculate the derivative of the softmax function in a more vectorized numpy fashion. However, the fact that the partial derivatives approach to zero might not be a math issue, and just be a problem of the learning rate or the known dying weight issue with complex deep neural networks. Layers like ReLU help preventing the latter issue.

First, I've used the following signal (just duplicating your last entry) to make it 4 samples x 3 features so is easier to see what is going on with the dimensions.

>>> signal = [[0.3394572666491664, 0.3089068053925853, 0.3516359279582483], [0.33932706934615525, 0.3094755563319447, 0.3511973743219001], [0.3394407172182317, 0.30889042266755573, 0.35166886011421256], [0.3394407172182317, 0.30889042266755573, 0.35166886011421256]]
>>> signal.shape
(4, 3)

Next, you want to compute the Jacobian matrix of your softmax function. According to the cited page it is defined as -hi * hj for the off-diagonal entries (majority of the matrix for n_features > 2), so lets start there. In numpy, you can efficiently calculate that Jacobian matrix using broadcasting:

>>> J = - signal[..., None] * signal[:, None, :]
>>> J.shape
(4, 3, 3)

The first signal[..., None] (equivalent to signal[:, :, None]) reshapes the signal to (4, 3, 1) while the second signal[:, None, :] reshapes the signal to (4, 1, 3). Then, the * just multiplies both matrices element-wise. Numpy's internal broadcasting repeats both matrices to form the n_features x n_features matrix for every sample.

Then, we need to fix the diagonal elements:

>>> iy, ix = np.diag_indices_from(J[0])
>>> J[:, iy, ix] = signal * (1. - signal)

The above lines extract diagonal indices for n_features x n_features matrix. It is equivalent of doing iy = np.arange(n_features); ix = np.arange(n_features). Then, replaces the diagonal entries with your defitinion hi * (1 - hi).

Last, according to the linked source, you need to sum across rows for each of the samples. That can be done as:

>>> J = J.sum(axis=1)
>>> J.shape
(4, 3)

Find bellow a summarized version:

if derivative:
    J = - signal[..., None] * signal[:, None, :] # off-diagonal Jacobian
    iy, ix = np.diag_indices_from(J[0])
    J[:, iy, ix] = signal * (1. - signal) # diagonal
    return J.sum(axis=1) # sum across-rows for each sample

Comparison of the derivatives:

>>> signal = [[0.3394572666491664, 0.3089068053925853, 0.3516359279582483], [0.33932706934615525, 0.3094755563319447, 0.3511973743219001], [0.3394407172182317, 0.30889042266755573, 0.35166886011421256], [0.3394407172182317, 0.30889042266755573, 0.35166886011421256]]
>>> e_x = np.exp( signal )
>>> signal = e_x / np.sum( e_x, axis = 1, keepdims = True )

Yours:

>>> np.multiply( signal, 1 - signal ) + sum(
        # handle the off-diagonal values
        - signal * np.roll( signal, i, axis = 1 )
        for i in xrange(1, signal.shape[1] )
    )
array([[  2.77555756e-17,  -2.77555756e-17,   0.00000000e+00],
       [ -2.77555756e-17,  -2.77555756e-17,  -2.77555756e-17],
       [  2.77555756e-17,   0.00000000e+00,   2.77555756e-17],
       [  2.77555756e-17,   0.00000000e+00,   2.77555756e-17]])

Mine:

>>> J = signal[..., None] * signal[:, None, :]
>>> iy, ix = np.diag_indices_from(J[0])
>>> J[:, iy, ix] = signal * (1. - signal)
>>> J.sum(axis=1)
array([[  4.16333634e-17,  -1.38777878e-17,   0.00000000e+00],
       [ -2.77555756e-17,  -2.77555756e-17,  -2.77555756e-17],
       [  2.77555756e-17,   1.38777878e-17,   2.77555756e-17],
       [  2.77555756e-17,   1.38777878e-17,   2.77555756e-17]])

First off, kudos for the super intuitive answer! When timing the code: yours seem to be the most efficient at high sample size, while the snippet I provided is more efficient if the dataset has a high number of features. Regardless, the calculated derivatives are "always" in the range of 10^-17 -- in other words around 0. — jorgenkg, Mar 29 '16 at 10:59
@jorgenkg it seems, however, that we are not getting the same derivatives. Note that your results for the last sample (last 2, they are the same) vanish respect to the 2nd feature, while mines don't (see the edit). Try that version to see if it has better numerical stability and doesn't completely vanish your derivatives. — Imanol Luengo, Mar 29 '16 at 11:15
You are definitively correct, NumPy shows a much better precision with your implementation - the precision issue in my code might arise from the built-in ’sum()’ call. I would have upvoted you twice — jorgenkg, Mar 29 '16 at 11:39
@jorgenkg try replacing sum with `np.sum` as it will use numpy and you might not lose precision that way. — Imanol Luengo, Mar 29 '16 at 11:45
@jorgenkg Uhmm.. tried it. For some weird reason I cannot get the same precision with your approach, although I have verified that both methods perform exactly the same operations. — Imanol Luengo, Mar 29 '16 at 11:51
Yes, my best guess is that the internal multiplications & additions in the matrix operations has a higher accuracy that the more abstract summation calls. However, I've unfortunately not figured out what the problem is — jorgenkg, Mar 29 '16 at 17:17
I am a bit confused. Does this mean https://datascience.stackexchange.com/questions/29735/how-to-apply-the-gradient-of-softmax-in-backprop is wrong? — harveyslash, Jul 06 '18 at 05:12
Also, it is hard for me to believe that the softmax gradient values are so tiny. Softmax is generally used with cross entropy , and the values of the cross entropy loss are usually < 10, which means the gradients of the layer before softmax will get ~4.16333634e-16 which is tiny. The network will barely learn. Where am i going wrong ? — harveyslash, Jul 06 '18 at 07:06

Michael Bay · Answer 2 · 2022-06-11T15:57:32.887

@Imanol Luengo's solution is wrong the moment he takes sums across rows.

@Harveyslash also makes a good point since he noticed extremely low "gradients" in his solution, the NN won't learn or learn in the wrong direction.

We have 4 samples x 3 inputs

The thing is that softmax is not a scalar function that takes just 1 input, but 3 in this case. Remember that all inputs in one sample must add up to one, and therefore you can't compute the value of one without knowing the others? This implies that the gradient must be a square matrix, because we need to also take into account our partial derivatives.

TLDR: The output of the gradient of softmax(sample) in this case must be a 3x3 matrix.

This is correct:

J = - signal[..., None] * signal[:, None, :] # off-diagonal Jacobian
iy, ix = np.diag_indices_from(J[0])
J[:, iy, ix] = signal * (1. - signal) # diagonal

Up until this point Imanol uses fast vector operations to compute the Jacobian of the softmax function in 4 points, resulting in a 3x3 matrix stacked 4 times: 4 x 3 x 3.

Now I think what the OP really wants is dJdZ, the first step in ANN backprop:

dJdZ(4x3) = dJdy(4x3) * gradSoftmax[layer signal(4x3)](?,?)

The problem is that we usually (with sigmoid, ReLU,... any scalar activation function) can compute the gradient as a stacked vector and then multiply element-wise with dJdy, but here we got a stacked matrix. How can we marry the two concepts?

The vector can be seen as the non zero elements of a diagonal matrix -> All this time we have been getting away with easy element-wise multiplication just because our activation function was scalar! For our softmax it's not that simple, and therefore we have to use matrix multiplication

dJdZ(4x3) = dJdy(4-1x3) * anygradient[layer signal(4,3)](4-3x3)

Now we multiply each 1x3 dJdy vector times the 3x3 gradient, for each of the 4 samples, but usually common operations will fail. We need to specify along which dimensions we multiply too (use np.einsum). End result:

#For reference 'mnr,mrr->mr' = 4x1x3,4x3x3->4x3
dJdZ = np.einsum('mnr,mrr->mr', dJdy[:,None,:], gradient)

Softmax derivative in NumPy approaches 0 (implementation)

Edit

2 Answers2

Linked