I'm trying to implement the softmax function for a neural network written in Numpy. Let h be the softmax value of a given signal i.
I've struggled to implement the softmax activation function's partial derivative.
I'm currently stuck at issue where all the partial derivatives approaches 0 as the training progresses. I've cross-referenced my math with this excellent answer, but my math does not seem to work out.
import numpy as np
def softmax_function( signal, derivative=False ):
# Calculate activation signal
e_x = np.exp( signal )
signal = e_x / np.sum( e_x, axis = 1, keepdims = True )
if derivative:
# Return the partial derivation of the activation function
return np.multiply( signal, 1 - signal ) + sum(
# handle the off-diagonal values
- signal * np.roll( signal, i, axis = 1 )
for i in xrange(1, signal.shape[1] )
)
else:
# Return the activation signal
return signal
#end activation function
The signal
parameter contains the input signal sent into the activation function and has the shape (n_samples, n_features).
# sample signal (3 samples, 3 features)
signal = [[0.3394572666491664, 0.3089068053925853, 0.3516359279582483], [0.33932706934615525, 0.3094755563319447, 0.3511973743219001], [0.3394407172182317, 0.30889042266755573, 0.35166886011421256]]
The following code snipped is a fully working activation function and is only included as a reference and proof (mostly for myself) that the conceptual idea actually work.
from scipy.special import expit
import numpy as np
def sigmoid_function( signal, derivative=False ):
# Prevent overflow.
signal = np.clip( signal, -500, 500 )
# Calculate activation signal
signal = expit( signal )
if derivative:
# Return the partial derivation of the activation function
return np.multiply(signal, 1 - signal)
else:
# Return the activation signal
return signal
#end activation function
Edit
- The problem intuitively persist with simple single layer networks. The softmax (and its derivative) is applied at the final layer.