The task is binary classification via a neural network. The data is present in a dictionary, that contains composite names (as the key) of each entries and the labels (0 or 1, as the third element in the vector value). The first and second elements are the two parts of the composite name, which are used later to extract the corresponding features.
In both cases, the dictionary is transformed into two arrays for the purpose of performing a balanced undersampling of the majority class (that is present in 66% of the data):
data_for_sampling = np.asarray([key for key in list(data.keys())])
labels_for_sampling = [element[2] for element in list(data.values())]
sampler = RandomUnderSampler(sampling_strategy = 'majority')
data_sampled, label_sampled = sampler.fit_resample(data_for_sampling.reshape(-1, 1), labels_for_sampling)
Then the resampled arrays of names and labels are used to create train and test sets via the Kfold method:
kfolder = KFold(n_splits = 10, shuffle = True)
kfolder.get_n_splits(data_sampled)
for train_index, test_index in kfolder.split(data_sampled):
data_train, data_test = data_sampled[train_index], data_sampled[test_index]
Or the train_test_split method:
data_train, data_test, label_train, label_test = train_test_split(data_sampled, label_sampled, test_size = 0.1, shuffle = True)
Finally, the names from data_train and data_test are used to re-extract the relevant entries (by key) from the original dictionary, that is then used to gather the features of those entries. As far as I'm concerned, a single split of the 10-fold sets should provide similar train-test data distribution as the 90-10 train_test_split, and that seems to be true during training, where both training sets result in ~0.82 accuracy after only one epoch, run separately with model.fit(). However, when the corresponding models are evaluated using model.evaluate() on the test sets after said epoch, the set from train_test_split gives a score of ~0.86, while the set from Kfold is ~0.72. I have done numerous testing to see if it's just a bad random seed, which is not bounded, but the results stayed the same. The sets also have correctly balanced label distributions and seemingly well-shuffled entries.