Gradient Checking works for binary but fails for multi class

Question

I've built a logistic regression model for binary classification on Iris-dataset ( just two labels ) .. This model achieved good performance on all metrics and also passed the gradient check as given by Andrew Ng. But when I change the output activation to "Softmax" from "Sigmoid" and make it suitable for multi-class classification, even though the performance metrics are pretty good, this model fails the gradient check.

Same pattern for a Deep Neural Network, my implementation with numpy passed gradient check for binary classification but fails for multi class.

Logistic Regression ( Binary ) :

I chose a row-major implementation style for my features ( no. of rows , no. of cols ) but not the column major style, just to make it intuitive to understand and debug.

Dimensions: X = (100, 4 ) ; Weights = (4, 1 ); y = (100,1)

Algorithm Implementation Code (binary) :

import numpy as np

from sklearn.datasets import load_iris, load_digits
from sklearn.preprocessing import LabelBinarizer
from sklearn.metrics import log_loss
from keras.losses import CategoricalCrossentropy
from scipy.special import softmax


def sigmoid(x):

   return ( (np.exp(x)) / (1 + np.exp(x) )  )




 dataset = load_iris()
 lb = LabelBinarizer() # Not used for binary classification


 X = dataset.data
 y = dataset.target



 data = np.concatenate((X[:100],y[:100].reshape(-1,1)), axis = 1)
 np.random.shuffle(data)

 X_train = data[:, :-1]
 X_b = np.c_[np.ones((X_train.shape[0] , 1)), X_train]

 y_train = data[:, -1].reshape(-1,1)

 num_unique_labels = len( np.unique(y_train) )


 Weights = np.random.randn(X_train.shape[1]+1, num_unique_labels-1)* np.sqrt(1./ (X_train.shape[1]+1)  )



 m = X_b.shape[0]

 yhat = sigmoid( np.dot(X_b, Weights))
 loss = log_loss(y_train, yhat)


 error = yhat - y_train

 gradient = (1./m) * ( X_b.T.dot(error)  )

Gradient Checking ( binary ):

 grad = gradient.reshape(-1,1)
 Weights_delta = Weights.reshape(-1,1)
 num_params = Weights_delta.shape[0]

 JP = np.zeros((num_params,1))
 JM = np.zeros((num_params,1))
 J_app = np.zeros((num_params,1))

 ep = float(1e-7)



for i in range(num_params):


  Weights_add = np.copy(Weights_delta)

  Weights_add[i] = Weights_add[i] + ep


  Z_add = sigmoid(np.dot(X_b, Weights_add.reshape(X_train.shape[1]+1,num_unique_labels-1)))

  JP[i] = log_loss( y_train, Z_add)


  Weights_sub = np.copy(Weights_delta)

  Weights_sub[i] = Weights_sub[i] - ep



  Z_sub = sigmoid(np.dot(X_b, Weights_sub.reshape(X_train.shape[1]+1,num_unique_labels-1)))

  JM[i] = log_loss( y_train, Z_sub)


  J_app[i] = (JP[i] - JM[i]) / (2*ep)

num = np.linalg.norm(grad - J_app)

denom = np.linalg.norm(grad) + np.linalg.norm(J_app)

num/denom

This results in a value ( num/denom ) : 8.244172628899919e-10 . Which confirms that gradient calculation is appropriate. For multi_class version, I've used the same gradient calculation from above but changed the output activation to Softmax ( also taken from scipy ) , and used axis = 1 to identify highest probability of a sample as mine is a row-major implementation.

Algorithm Implementation Code (multi_class) :

*Dimensions: X = (150, 4) ; Weights = (4,3) ; y = (150, 3)*

import numpy as np

from sklearn.datasets import load_iris, load_digits
from sklearn.preprocessing import LabelBinarizer
from keras.losses import CategoricalCrossentropy
from scipy.special import softmax

CCE = CategoricalCrossentropy()


dataset = load_iris()
lb = LabelBinarizer()


X = dataset.data
y = dataset.target

lb.fit(y)

data = np.concatenate((X,y.reshape(-1,1)), axis = 1)
np.random.shuffle(data)

X_train = data[:, :-1]
X_b = np.c_[np.ones((X_train.shape[0] , 1)), X_train]


y_train = lb.transform(data[:, -1]).reshape(-1,3)


num_unique_labels = len( np.unique(y) )


Weights = np.random.randn(X_train.shape[1]+1, num_unique_labels) * np.sqrt(1./ (X_train.shape[1]+1)  )




m = X_b.shape[0]

yhat = softmax( np.dot(X_b, Weights), axis = 1)
cce_loss = CCE(y_train, yhat).numpy()

error = yhat - y_train

gradient = (1./m) * ( X_b.T.dot(error)  )

Gradient Checking ( multi_class ):

grad = gradient.reshape(-1,1)
Weights_delta = Weights.reshape(-1,1)
num_params = Weights_delta.shape[0]

JP = np.zeros((num_params,1))
JM = np.zeros((num_params,1))
J_app = np.zeros((num_params,1))

ep = float(1e-7)

for i in range(num_params):

   Weights_add = np.copy(Weights_delta)

   Weights_add[i] = Weights_add[i] + ep


   Z_add = softmax(np.dot(X_b, Weights_add.reshape(X_train.shape[1]+1,num_unique_labels)),                           axis = 1)

   JP[i] = CCE( y_train, Z_add).numpy()


   Weights_sub = np.copy(Weights_delta)

   Weights_sub[i] = Weights_sub[i] - ep


   Z_sub = softmax(np.dot(X_b, Weights_sub.reshape(X_train.shape[1]+1,num_unique_labels)), axis = 1)

   JM[i] = CCE( y_train, Z_sub).numpy()


   J_app[i] = (JP[i] - JM[i]) / (2*ep)


num = np.linalg.norm(grad - J_app)

denom = np.linalg.norm(grad) + np.linalg.norm(J_app)

num/denom

This resulted in a value: 0.3345. Which is clearly unacceptable difference. Now this got me wondering whether I could trust my gradient checking code for binary label in the first place. I've tested this logistic regression code (with same gradient calculation) on also digits data set, the performance again was really good ( >95% accuracy, precision, recall ). What's really fascinating to me is, even though the performance of the model is good enough, it fails gradient check. Same case for Neural Network as I mentioned earlier ( passes for binary , fails for multi_class ).

I even tried the code which Andrew Ng offers as part of his coursera course, even that code passes for binary and fails for multi class. I can't seem to figure out where my codes have any bugs in them, if they do minor bugs, how could they pass in the first case ?

I looked at these SOs, but I feel they had a different issue than mine:

Gradient checking in backpropogation

2.Checking the gradients when doing ...

3.problem with ann back-propagation ..

Here's what I'm looking for:

Suggestions / Corrections whether my gradient calculation and gradient checking code for binary prediction is accurate.
Suggestions / general directions on where I could be going wrong with multi class implementations.

What will you get: (:P)

Gratitude of 20-something tech-guy who believes every documentation page is poorly written :)

Update: Corrected some typos and added more lines of code as suggested by Alex. I also realized that my approximate gradient values ( by the name J_app ), in case of multi class prediction is pretty high ( 1e+2 ); because I was multiplying by a factor of (1./m) to my original gradients ( by the name gradient ), my original gradients values came out to be around (1e-1 to 1e-2 ).

This obvious difference in range of values of approximate gradients to my original gradients explains why I got a final value to the order of (1e+1, 0.3345 ). But, what I wasn't able to figure out is, how do I go about fixing this seemingly obvious bug I have.

You should post your actual code here, not the version with "# almost same thing here", "# softmax from scipy". It is clear that e.g. Weights in your multi_class case should not be of (4, 1) shape, but (4, 3) if you have 3 classes. I would say that you made a mistake in that part, but then your code also has `Weights_add.reshape(..., 3)`, so probably in your actual code you do have Weights of shape (4, 3). So if you want someone to help you - edit the question to have your actual code and that will greatly increase chances that someone will help you. — Alexander Pivovarov, Jun 16 '20 at 05:24
@AlexanderPivovarov I appreciate your feedback. I agree I had lot of typos in my first version of code, and the information I gave through my code wasn't obvious and it was hard to debug the issue. So, I made those changes now. I'd have appreciated even more had you also given insights or even general directions on what could've gone wrong, rather than sere blaming and complaining. — Amith Adiraju, Jun 16 '20 at 19:30
I actually did try my best to help you. I loaded your code and made all the changes to make it possible to run the snippets you gave in the original question. And after doing all of that I still couldn't really see what was your computation really doing (because you didn't really show the important parts) - that computation you asked to help debugging. So I did go and ask you to add more details about the question. But of course, from your point of view I'm just complaining. — Alexander Pivovarov, Jun 16 '20 at 19:43
@AlexanderPivovarov , Thank you for adding the missing pieces of my code and trying to figure out where I went wrong. As I said, I should've payed more attention while creating the question, by adding more relevant information. Appreciate your time :) ! — Amith Adiraju, Jun 16 '20 at 20:41

Alexander Pivovarov · Accepted Answer · 2020-06-17T02:31:42.280

1

All your computations seem to be correct. The reason why gradient check is failing is because CategoricalCrossentropy from keras is running with single precision by default. Because of that you are not getting enough precision in the final loss difference caused by weights' small updates. Add the following lines in the beginning of your script and you will get num/denom being usually around 1.e-9:

import keras
keras.backend.set_floatx('float64')

edited Jun 17 '20 at 02:31

answered Jun 17 '20 at 02:25

Alexander Pivovarov

4,850
1
11
34

It worked like a charm ! :) ... Thank you so much :) ! ! I had spent many sleepless nights thinking how my life could improve if I can't solve this issue ( LOL ) ... Didn't realize this detail about keras backend precision ... Thanks again :D – Amith Adiraju Jun 17 '20 at 03:20
1

Happy to help. Initially I was mostly suspecting something was wrong with the gradient computation. Only when you provided fully working reproducible script I was able to get to the root cause. – Alexander Pivovarov Jun 17 '20 at 03:29
1

Alex, just curious about one interesting aspect of this code, I might have found something strange, see if this interests you. If I remove the Xavier initialization variant from my weights , like just using `Weights = np.random.randn(X_train.shape[1]+1, num_unique_labels)` ; and ran code without any other changes, I kept getting result to the order of 1e-1. But, if I put xavier initialization back to my weights, I get result to the order of 1e-9. Curious to know what you're thoughts are about this. Nothing wrong with your solution, just trying to see if there's something more to it .. – Amith Adiraju Jun 17 '20 at 22:30
I personally feel weight initialization, shouldn't matter whether gradient calculation are on the right track or not. But in this specific case, it's changing the result a great deal. So I'm trying to see why this could be the case, you've already helped so much , apologies if I'm pushing you too often , felt this code behavior is worth sharing ... – Amith Adiraju Jun 17 '20 at 22:34
1

At some point even double precision won't be enough for such graident approximations. It all depends on computation graph you have. In this particular case it seems that whenever you have a very small prediction for correct class small changes in the weights are not enough to move the final results. You can still resolve this by creating `CCE = CategoricalCrossentropy(from_logits=True)` and provide CCE with logits (pre-softmax values) for loss computation. This will increase precision of the computation and you will see gradients matching again. – Alexander Pivovarov Jun 17 '20 at 23:07
1

Initialization matters here because with small absolute weights you will have much lower chance of hitting the issue having very small prediction for the correct class. Thus initialization with higher absolute weights is giving you wrong result often (but not always depending on the random seed). By the way - I would absolutely recommend to always use `from_logits` version of the CCE instead of actually softmaxing values. – Alexander Pivovarov Jun 17 '20 at 23:08
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/216175/discussion-between-amith-adiraju-and-alexander-pivovarov). – Amith Adiraju Jun 17 '20 at 23:41

Gradient Checking works for binary but fails for multi class

1 Answers1