K folding using sklearn with specific clusters instead of spliting with specific size

Question

I would like to do a K-fold cross validation with sklearn in python.My data has 8 users and i only do K-fold on the data of one user.Is it possible to do cross validation between the users?For instance to use 7 users as a train set and 1 user as test set and do that for those 8 different occasions?

score 2 · Accepted Answer · answered Sep 24 '19 at 10:35

2

Yes, this is possible. You can use cross-validation with groups for this. If you want to make sure that data points from one person are in either the training or the testing set, this is called grouping or blocking. in scikit-learn, such a thing can be achieved by adding an array with group membership values to cross_val_scores. Then you can use the GroupKFold class of scikit-learn with the number of groups as Cross-validation procedure. See example below. (Simple logistic regression model just to illustrate usasge of the GroupKFold class)

from sklearn.model_selection import GroupKFold
# create synthetic dataset
X, y = make_blobs(n_samples=12, random_state=0)

# the first three samples belong to the same group, etc.
groups = [0, 0, 0, 1, 1, 1, 2, 2, 2, 3, 3, 3]

scores = cross_val_score(logreg, X, y, groups, cv=GroupKFold(n_splits=4))

print("cross_val_score(logreg, X, y, groups, cv=GroupKFold(n_splits=4)")
print("Cross-validation scores :\n{}".format(scores))

answered Sep 24 '19 at 10:35

Psychotechnopath

2,471
5
26
47

However my data set consists of 8 users,whom timeseries are not equal with eachother.So....if i understand it correctly,putting them in one array probably wont work.Am i forced to make them equal? – Anast Tzin Sep 24 '19 at 11:40
I think GroupKFold does not automatically randomize your data (As other K-folds do not do this as well), only when you specify the parameter shuffle=True. Thus you can put them in one array, because the order of data will be preserved. – Psychotechnopath Sep 24 '19 at 12:22
So...if the first user has 51000 timestamps and the second one 55600 timestamps i will create an array and assign the first 51000 to user 1,and the 55600 next ones to user 2 etc? – Anast Tzin Sep 24 '19 at 12:43
if you specify correctly to the GroupKFold function to which uses each of your timestamps belong (I assume you have something like 51k rows with timestamp data for user x, and 55.6k rows with timestamp data for user y). So specify groups = [1,2,3,4,5,6,7,8] if you have 8 users, and make your GroupKfold on those groups, then sklearn will automatically take care of this for you. – Psychotechnopath Sep 24 '19 at 13:03

K folding using sklearn with specific clusters instead of spliting with specific size

1 Answers1