1

I have a softmax layer (only the activation itself, without the linear part of multiplying inputs by weights), and I want to make for it a backward pass.

I have found many tutorials/answers on SO that deals with it, but they all seemed to use X as (1, n_inputs) vector. I want to use it as (n_samples, n_inputs) array, and still, have a correct vectorized implementation of the forward/backward pass.

I have written a following forward pass, normalizing the output for each row/sample (is it correct?):

import numpy as np

X = np.asarray([
    [0.0, 0.0],
    [0.0, 1.0],
    [1.0, 0.0],
    [1.0, 1.0]], dtype=np.float32)

def prop(self, X):
    s = np.exp(X)
    s = s.T / np.sum(s, axis=1)
    return s.T

It gives me the final result of forward propagation (including other layers) as:

Y = np.asarray([
       [0.5       , 0.5       ],
       [0.87070241, 0.12929759],
       [0.97738616, 0.02261384],
       [0.99200957, 0.00799043]], dtype=np.float32))

So, this is the output of the softmax, if it is correct. Now, how should I write the backward pass?

I have derived the derivative of the softmax to be:

1) if i=j: p_i*(1 - p_j),

2) if i!=j: -p_i*p_j,

where equation

I've tried to compute the derivative as:

ds = np.diag(Y.flatten()) - np.outer(Y, Y) 

But it results in the 8x8 matrix which does not make sense for the following backpropagation... What is the correct way to write it?

Grzegorz Skibinski
  • 12,624
  • 2
  • 11
  • 34
Valeria
  • 1,508
  • 4
  • 20
  • 44

2 Answers2

4

I've been dealing with the same problem and finally figured out a way to vectorize a batch implementation of the softmax Jacobian. I came up with it myself so I am not sure if it's the optimal way to do it. Here's my idea:

import numpy as np
from scipy.special import softmax

def Jsoftmax(X):
    sh = X.shape
    sm = softmax(X, axis = 1)
    DM = sm.reshape(sh[0],-1,1) * np.diag(np.ones(sh[1])) # Diagonal matrices
    OP = np.matmul(sm.reshape(sh[0],-1,1), sm.reshape(sh[0],1,-1)) # Outer products
    Jsm = DM - OP
    return Jsm

It produces a (n_samples, n_inputs, n_inputs)-shaped array, which I think can be used in backpropagation with the np.matmul function to properly premultiply by your dJ_dA array.

It should be noted that softmax is almost exclusively used as the last layer and commonly with a cross-entrpy loss objective function. In that case, the deriative of the objective function with respect to the softmax inputs can be more efficiently found as (S - Y)/m, where m is the number of examples in the batch, Y are your batch's labels, and S are your softmax outputs. This is explained in the following link.

Alyona Yavorska
  • 569
  • 2
  • 14
  • 20
0

I found this question quite useful when was writing my softmax function: Softmax derivative in NumPy approaches 0 (implementation). Hope it helps.

Denzel
  • 359
  • 4
  • 12