How is it that the accuracy score for 10-fold cross validation is worst than for a 90-10 train_test_split using sklearn?

Question

The task is binary classification via a neural network. The data is present in a dictionary, that contains composite names (as the key) of each entries and the labels (0 or 1, as the third element in the vector value). The first and second elements are the two parts of the composite name, which are used later to extract the corresponding features.

In both cases, the dictionary is transformed into two arrays for the purpose of performing a balanced undersampling of the majority class (that is present in 66% of the data):

data_for_sampling = np.asarray([key for key in list(data.keys())])
labels_for_sampling = [element[2] for element in list(data.values())]

sampler = RandomUnderSampler(sampling_strategy = 'majority')
data_sampled, label_sampled = sampler.fit_resample(data_for_sampling.reshape(-1, 1), labels_for_sampling)

Then the resampled arrays of names and labels are used to create train and test sets via the Kfold method:

kfolder = KFold(n_splits = 10, shuffle = True)
kfolder.get_n_splits(data_sampled)

for train_index, test_index in kfolder.split(data_sampled):

        data_train, data_test = data_sampled[train_index], data_sampled[test_index]

Or the train_test_split method:

data_train, data_test, label_train, label_test = train_test_split(data_sampled, label_sampled, test_size = 0.1, shuffle = True)

Finally, the names from data_train and data_test are used to re-extract the relevant entries (by key) from the original dictionary, that is then used to gather the features of those entries. As far as I'm concerned, a single split of the 10-fold sets should provide similar train-test data distribution as the 90-10 train_test_split, and that seems to be true during training, where both training sets result in ~0.82 accuracy after only one epoch, run separately with model.fit(). However, when the corresponding models are evaluated using model.evaluate() on the test sets after said epoch, the set from train_test_split gives a score of ~0.86, while the set from Kfold is ~0.72. I have done numerous testing to see if it's just a bad random seed, which is not bounded, but the results stayed the same. The sets also have correctly balanced label distributions and seemingly well-shuffled entries.

score 1 · Accepted Answer · answered Sep 10 '21 at 08:21

As it turns out, the problem originates from a combination of sources:

While shuffle = True in the train_test_split() method properly shuffles the provided data first, then splits it into the desired parts, the shuffle = True in the Kfold method only results in the randomly built folds, however the data within the folds remains ordered.

This is something the documentation points out, thanks to this post: https://github.com/scikit-learn/scikit-learn/issues/16068

Now, during learning, my custom train function applies shuffle again on the train data, just to be sure, however it does not shuffle the test data. Moreover, model.evaluate() defaults to batch_size = 32, if no parameter is given, which paired with the ordered test data resulted in the discrepancy in the validation accuracy. The test data is indeed flawed in the sense that it contains large portion of "hard-to-predict" entries, which were clustered together thanks to the ordering and seems like they dragged down the average accuracy in the results. Given a completed run across all N folds, as pointed out by TC Arlen, may have indeed given a more precise estimation in the end, but I've expected closer results after only one fold, which lead to the discovery of this problem.

score 0 · Answer 2 · answered Sep 08 '21 at 15:27

Depending on the amount of noise in the data and on the size of the dataset, this could be expected behavior to see scores on out of sample data to deviate by this amount. One split is not guaranteed to be just like any other split, which is why you have 10 in the first place and then average across all results.

What you should trust to be the most generalizable is not any one given split (whether that comes from one of the 10 folds or train_test_split()), but what is far more trustworthy is the average result across all N folds.

Digging deeper into the data could reveal whether there is some reason why one or more splits deviate so much from another. For example, perhaps there is some feature in your data (e.g. "date the sample was collected" and the collection methodology changed from month to month) that makes the data differ from one another in a biased way. If that is the case, you should use a stratified test split (in your CV as well) (see the scikit-learn documentation on that) so you can get a more unbiased grouping of your data.

How is it that the accuracy score for 10-fold cross validation is worst than for a 90-10 train_test_split using sklearn?

2 Answers2