SkikitLearn learning curve strongly dependent on batch size of MLPClassifier ??? Or: how to diagnose bias/ variance for NN?

Question

I am currently working on a classification problem with two classes in ScikitLearn with the solver adam and activation relu. To explore if my classifier suffers from high bias or high variance, I plotted the learning curve with Scikitlearns build-in function:

https://scikit-learn.org/stable/auto_examples/model_selection/plot_learning_curve.html

I am using a Group-K_Fold crossvalidation with 8 splits. However, I found that my learning curve is strongly dependent on the batch size of my classifier:

https://i.stack.imgur.com/FBLrY.jpg

Is it supposed to be like this? I thought learning curves are tackling the accuracy scores dependent on the portion of training data independent from any batches/ epochs? Can I actually use this build-in function for batch methods? If yes, which batch size should I choose (full batch or batch size= number of training examples or something in between) and what diagnosis do I get from this? Or how do you usually diagnose bias/ variance problems of a neural network classifier?

Help would be really appreciated!

Maybe also for this example: What would be my diagnosis here? For me it looks like high bias, since training and crossvalidation score are low. However, looking at batchsize of 200, it looks like as if I would have taken batchsize= number of training examples, training score would be high and it would look like overfitting. — S.Maria, Mar 26 '19 at 08:17

score 2 · Accepted Answer · edited Jun 20 '20 at 09:12

Yes, the learning curve depends on the batch size.

The optimal batch size depends on the type of data and the total volume of the data.
In ideal case batch size of 1 will be best, but in practice, with big volumes of data, this approach is not feasible.
I think you have to do that through experimentation because you can’t easily calculate the optimal value.

Moreover, when you change the batch size you might want to change the learning rate as well so you want to keep the control over the process.
But indeed having a tool to find the optimal (memory and time-wise) batch size is quite interesting.

What is Stochastic Gradient Descent?

Stochastic gradient descent, often abbreviated SGD, is a variation of the gradient descent algorithm that calculates the error and updates the model for each example in the training dataset.

The update of the model for each training example means that stochastic gradient descent is often called an online machine learning algorithm.

What is Batch Gradient Descent?

Batch gradient descent is a variation of the gradient descent algorithm that calculates the error for each example in the training dataset, but only updates the model after all training examples have been evaluated.

One cycle through the entire training dataset is called a training epoch. Therefore, it is often said that batch gradient descent performs model updates at the end of each training epoch.

What is Mini-Batch Gradient Descent?

Mini-batch gradient descent is a variation of the gradient descent algorithm that splits the training dataset into small batches that are used to calculate model error and update model coefficients.

Implementations may choose to sum the gradient over the mini-batch or take the average of the gradient which further reduces the variance of the gradient.

Mini-batch gradient descent seeks to find a balance between the robustness of stochastic gradient descent and the efficiency of batch gradient descent. It is the most common implementation of gradient descent used in the field of deep learning.

Source: https://machinelearningmastery.com/gentle-introduction-mini-batch-gradient-descent-configure-batch-size/

Another question for my understanding of minibatch: Is it correct that weights are updated after processing one batch? However, why is the learning curve changing then after every single training example? There seems to be a misconception from my side, hope anybody can clarify this — S.Maria, Mar 26 '19 at 11:48
Think I just got it: After each batch the model is updated. However, learning curve evaluates the model for each training examples. The more training example the higher training score, within the confines of the last update of the last batch. However, then the learning curve is useless for full batch, since there will never be an update in my learning curve? Is this correct? — S.Maria, Mar 26 '19 at 11:58
And is a learning curve useful at all for neural networks, since you are training over several epochs anyway? How do you assess the NN model? Especially in ScikitLearn? — S.Maria, Mar 26 '19 at 12:29

SkikitLearn learning curve strongly dependent on batch size of MLPClassifier ??? Or: how to diagnose bias/ variance for NN?

1 Answers1

What is Stochastic Gradient Descent?

What is Batch Gradient Descent?

What is Mini-Batch Gradient Descent?