I want to split data into train,test and validation datasets which are stratification, but sklearn only provides cross_validation.train_test_split which only can divide into 2 pieces. What should i do if i want do this
Asked
Active
Viewed 6,575 times
2 Answers
6
If you want to use a Stratified Train/Test split, you can use StratifiedKFold in Sklearn
Suppose X
is your features and y
are your labels, based on the example here :
from sklearn.model_selection import StratifiedKFold
cv_stf = StratifiedKFold(n_splits=3)
for train_index, test_index in skf.split(X, y):
print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
Update : To split data into say 3 different percentages use numpy.split() can be done like this :
X_train, X_test, X_validate = np.split(X, [int(.7*len(X)), int(.8*len(X))])
y_train, y_test, y_validate = np.split(y, [int(.7*len(y)), int(.8*len(y))])

Gambit1614
- 8,547
- 1
- 25
- 51
-
thanks for answer,but i want to split a dataset into three pieces like[70%,20%,10%],StratifiedKFold may not help. – loseryao Sep 15 '17 at 06:03
-
@loseryao oh sorry, I thought you mean into 3 different folds, I will update that. – Gambit1614 Sep 15 '17 at 06:05
-
Might be smart to shuffle the data. – Arya McCarthy Sep 15 '17 at 16:17
1
You can also use train_test_split
more than once to achieve this. The second time, run it on the training output from the first call to train_test_split
.
from sklearn.model_selection import train_test_split
def train_test_validate_stratified_split(features, targets, test_size=0.2, validate_size=0.1):
# Get test sets
features_train, features_test, targets_train, targets_test = train_test_split(
features,
targets,
stratify=targets,
test_size=test_size
)
# Run train_test_split again to get train and validate sets
post_split_validate_size = validate_size / (1 - test_size)
features_train, features_validate, targets_train, targets_validate = train_test_split(
features_train,
targets_train,
stratify=targets_train,
test_size=post_split_validate_size
)
return features_train, features_test, features_validate, targets_train, targets_test, targets_validate

Nathan Karasch
- 198
- 1
- 8