Vectorized softmax gradient

Question

I have a softmax layer (only the activation itself, without the linear part of multiplying inputs by weights), and I want to make for it a backward pass.

I have found many tutorials/answers on SO that deals with it, but they all seemed to use X as (1, n_inputs) vector. I want to use it as (n_samples, n_inputs) array, and still, have a correct vectorized implementation of the forward/backward pass.

I have written a following forward pass, normalizing the output for each row/sample (is it correct?):

import numpy as np

X = np.asarray([
    [0.0, 0.0],
    [0.0, 1.0],
    [1.0, 0.0],
    [1.0, 1.0]], dtype=np.float32)

def prop(self, X):
    s = np.exp(X)
    s = s.T / np.sum(s, axis=1)
    return s.T

It gives me the final result of forward propagation (including other layers) as:

Y = np.asarray([
       [0.5       , 0.5       ],
       [0.87070241, 0.12929759],
       [0.97738616, 0.02261384],
       [0.99200957, 0.00799043]], dtype=np.float32))

So, this is the output of the softmax, if it is correct. Now, how should I write the backward pass?

I have derived the derivative of the softmax to be:

1) if i=j: p_i*(1 - p_j),

2) if i!=j: -p_i*p_j,

where

I've tried to compute the derivative as:

ds = np.diag(Y.flatten()) - np.outer(Y, Y)

But it results in the 8x8 matrix which does not make sense for the following backpropagation... What is the correct way to write it?

score 4 · Answer 1 · edited Jun 16 '20 at 04:44

I've been dealing with the same problem and finally figured out a way to vectorize a batch implementation of the softmax Jacobian. I came up with it myself so I am not sure if it's the optimal way to do it. Here's my idea:

import numpy as np
from scipy.special import softmax

def Jsoftmax(X):
    sh = X.shape
    sm = softmax(X, axis = 1)
    DM = sm.reshape(sh[0],-1,1) * np.diag(np.ones(sh[1])) # Diagonal matrices
    OP = np.matmul(sm.reshape(sh[0],-1,1), sm.reshape(sh[0],1,-1)) # Outer products
    Jsm = DM - OP
    return Jsm

It produces a (n_samples, n_inputs, n_inputs)-shaped array, which I think can be used in backpropagation with the np.matmul function to properly premultiply by your dJ_dA array.

It should be noted that softmax is almost exclusively used as the last layer and commonly with a cross-entrpy loss objective function. In that case, the deriative of the objective function with respect to the softmax inputs can be more efficiently found as (S - Y)/m, where m is the number of examples in the batch, Y are your batch's labels, and S are your softmax outputs. This is explained in the following link.

score 0 · Answer 2 · answered Dec 12 '19 at 06:50

0

I found this question quite useful when was writing my softmax function: Softmax derivative in NumPy approaches 0 (implementation). Hope it helps.

answered Dec 12 '19 at 06:50

Denzel

359
4
12

Vectorized softmax gradient

2 Answers2