MLP with partial_fit() performing worse than with fit() in a supervised classification

Question

The learning dataset I'm using is a grayscale image that was flatten to have each pixel representing an individual sample. The second image will be classified pixel by pixel after training the Multilayer perceptron (MLP) classifier on the former one.

The problem I have is that the MLP is performing better when it receives the training dataset all at once (fit()) compared to when it is trained by chunks (partial_fit()). I'm keeping the default parameters provided by Scikit-learn in both cases.

I'm asking this question because when the training dataset is in the order of millions of samples, I will have to employ partial_fit() to train the MLP by chunks.

def batcherator(data, target, chunksize):
    for i in range(0, len(data), chunksize):
        yield data[i:i+chunksize], target[i:i+chunksize]

def classify():
    classifier = MLPClassifier(verbose=True)

    # classifier.fit(training_data, training_target)

    gen = batcherator(training.data, training.target, 1000)
    for chunk_data, chunk_target in gen:
        classifier.partial_fit(chunk_data, chunk_target,
                               classes=np.array([0, 1]))

    predictions = classifier.predict(test_data)

My question is which parameters should I adjust in the MLP classifier to make its results more acceptable when it's trained by chunks of data?

I've tried to increase the number of neurons in the hidden layer using hidden_layer_sizes but I didn't see any improvement. No improvement either if I change the activation function of the hidden layer from the default relu to logistic using the activation parameter.

Below are the images I'm working on (all of them are 512x512 images) with a link to the Google Fusion table where they were exported as CSV from the numpy arrays (to leave the image as a float instead of an int):

Training_data:

The white areas are masked out: Google Fusion Table (training_data)

Class0:

Class1:

Training_target:

Google Fusion Table (training_target)

Test_data:

Google Fusion Table (test_data)

Prediction (with partial_fit):

Google Fusion Table (predictions)

You are comparing a linear-classifier (less affected by parameter-tuning) and a non-linear classifier (more affected) with default settings on some unknown dataset without showing preprocessing and co, asking why one dominates the other without showing some metric about this domination. Furthermore: as one is linear, the other non-linear, you tried to improve the non-linear one, which was worse, by making it *more nonlinear*. Most of this does not make much sense and as it's non reproducible it's hard to help. Except for maybe saying: remove the hidden layer / increase regularization. — sascha, Dec 06 '17 at 01:39
Also: SGD is allowed to do 5 epochs. MLP does 1 (with quite big minibatches). — sascha, Dec 06 '17 at 01:45
@sascha I've just updated the question as my problem changed a bit. Let me know if any aspect of the question needs more detail. — Hakim, Dec 07 '17 at 19:55

score 4 · Accepted Answer · answered Dec 11 '17 at 17:15

TL,DR: make several loops over your data with small learning rate and different order of observations, and your partial_fit will perform as nice as fit.

The problem with partial_fit with many chunks is that when your model completes the last chunk, it may forget the first one. This means, changes in the model weights due to the early batches would be completely overwritten by the late batches.

This problem, however, can be solved easily enough with a combination of:

Low learning rate. If model learns slowly, then it also forgets slowly, and the early batches would not be overwritten by the late batches. Default learning rate in MLPClassifier is 0.001, but you can change it by multiples of 3 or 10 and see what happens.
Multiple epochs. If learning rate is slow, then one loop over all the training sample might be less than enough for model to converge. So you can make several loops over the training data, and result would most probably improve. The intuitive strategy is to increase yout number of loops by the same factor that you decrease the learning rate.
Shuffling observations. If images of dogs go before images of cats in your data, then in the end model will remember more about cats than about dogs. If, however, you shuffle your observatons somehow in the batch generator, it will not be a problem. The safest strategy is to reshuffle the data anew before each epoch.

You were right about the `shuffling` being necessary. When I add it I get similar results with `partial_fit()` to those I got with `fit()`. Another factor that improved slightly the accuracy of the classification is the [balancing](http://contrib.scikit-learn.org/imbalanced-learn/stable/generated/imblearn.over_sampling.SMOTE.html) of the dataset (as I have one training class which is quite large compared to the other one). I didn't try to adjust the learning rate and the epochs yet though. — Hakim, Dec 12 '17 at 11:17
@Hakim I noticed that `partial_fit` run for 1 epoch by default. So, are you running it for more epochs than that ? If Yes, How ? — Pe Dro, Aug 27 '20 at 08:15

score 0 · Answer 2 · answered Sep 01 '20 at 05:24

Rather than manually providing a rate, you can use adaptive learning rate functionality provided by sklearn.

model = SGDClassifier(loss="hinge", penalty="l2", alpha=0.0001, max_iter=3000, tol=None, shuffle=True, verbose=0, learning_rate='adaptive', eta0=0.01, early_stopping=False)

This is described in the [scikit docs] as:

‘adaptive’: eta = eta0, as long as the training keeps decreasing. Each time n_iter_no_change consecutive epochs fail to decrease the training loss by tol or fail to increase validation score by tol if early_stopping is True, the current learning rate is divided by 5.

MLP with partial_fit() performing worse than with fit() in a supervised classification

2 Answers2

Linked