Questions tagged [train-test-split]

Questions with this tag are about how to split the machine learning data set into random train and test subsets.

In particular questions with this tag can be aimed at understanding better how to split the data with the scikit-learn functionality. In scikit-learn a random split into training and test sets can be quickly computed with the train_test_split helper function.

Reference: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

428 questions

votes

1 answer

Difference between doing cross-validation and validation_data/validation_split in Keras

First, I split the dataset into train and test, for example: X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.4, random_state=999) I then use GridSearchCV with cross-validation to find the best performing…

asked Nov 07 '18 at 13:03

Long

1,482
21
33

votes

2 answers

how can I split data in 3 or more parts with sklearn

I want to split data into train,test and validation datasets which are stratification, but sklearn only provides cross_validation.train_test_split which only can divide into 2 pieces. What should i do if i want do this

python machine-learning scikit-learn cross-validation train-test-split

asked Sep 15 '17 at 05:44

loseryao

votes

1 answer

How to split a dataset to train/test where some rows are dependent?

I have a data set of subjects and each of them has a number of rows in my pandas dataframe (each measurement is a row and a subject could measure a few times). I would like to split my data into training and test set but I cannot split randomly…

python pandas train-test-split

asked Aug 31 '17 at 11:41

AR_

votes

1 answer

Stratified train-test splitting a Tensorflow dataset

I am currently working with a quite large image-dataset and I loaded it using ImageDataGenerator from tensorflow.keras in python. As the classification of my data is very imbalanced I wanted to do a stratified train-test-split to possibly achieve a…

python tensorflow keras train-test-split imbalanced-data

asked Mar 07 '22 at 13:12

user18398060

votes

1 answer

How to ensure all samples from specific group are all togehter in train/test in sklearn cross_val_predict?

I have a dataframe, where each sample belong to a group. For example: df = a b c group 1 1 2 G1 1 6 1 G1 8 2 8 G3 2 8 7 G2 1 9 2 G2 1 7 2 G3 4 0 2 G4 1 5 1 G4 6 7 8 G5 3 3 7 G6 1 2 2 …

python python-3.x scikit-learn cross-validation train-test-split

asked Jun 10 '20 at 11:52

Cranjis

1,590
8
31
64

votes

1 answer

Train Test Split sklearn based on group variable

My X is as follows: EDIT1: Unique ID. Exp start date. Value. Status. 001 01/01/2020. 4000. Closed 001 12/01/2019 4000. Archived 002 01/01/2020. 5000. Closed 002 12/01/2019 …

python scikit-learn sklearn-pandas train-test-split

asked May 15 '20 at 19:08

Zee

votes

2 answers

How to split images into test and train set using my own data in TensorFlow

I am a little confused here... I just spent the last hour reading about how to split my dataset into test/train in TensorFlow. I was following this tutorial to import my images: https://www.tensorflow.org/tutorials/load_data/images. Apparently one…

python scikit-learn tensorflow2.0 train-test-split

asked Feb 08 '20 at 20:30

Guillermina

3,127
3
15
24

votes

2 answers

Is it a flaw that Optuna examples return the evaluation metric of the test set?

I am using Optuna for parameter optimization for some models. In almost all the examples the objective function returns a evaluation metric on the TEST set, and tries to minimize/maximize this. I feel like this is a flaw in the examples since…

python hyperparameters train-test-split optuna

asked Jan 31 '20 at 10:45

brian

votes

0 answers

sklearn train_test_split dies and shuts down python kernel

I am struggling with using the train_test_split function from scikit-learn with 3d Numpy arrays. I have a feature array with shape (1860000, 144, 12) and a label array with shape (1860000,). In a different case train_test_split works well. But when…

python numpy scikit-learn train-test-split

asked Nov 19 '19 at 13:44

MadPhil

votes

1 answer

How to split a dataset into a train and test dataset using hashcode method

I am following the code of the Hands on Machine learning with Sci-kit learn and tensorflow 2nd edition. In the creating train and test dataset section they followed this procedure to create the training and testing dataset as follows: from zlib…

machine-learning train-test-split

asked Nov 12 '19 at 02:21

I. A

2,252
26
65

votes

1 answer

Scikit-Learn GroupShuffleSplit is not grouping by specified groups

I am trying to split a timeseries of farm data taken at a daily frequency for 8 years. I want to split the data so that the train and test sets each contain samples from different farms, and there is no overlap of farms between the train and test…

python-3.x pandas scikit-learn data-science train-test-split

asked Oct 21 '19 at 15:33

Wonton

votes

2 answers

Why accuracy of GridSearchCV method is lower than standard method?

I use train_test_split (random_state = 0) and decision tree without any parameter tuning to model my data, I run it about 50 times to achieve the best accuracy. import pandas as pd import numpy as np from sklearn import tree from sklearn.tree…

python decision-tree grid-search hyperparameters train-test-split

asked Jul 12 '19 at 08:39

mina

votes

2 answers

train_test_split not splitting data

There is a dataframe that consists of 14 columns in total, the last column is the target label with integer values = 0 or 1. I have defined: X = df.iloc[:,1:13] ---- this consists of the feature values y = df.iloc[:,-1] ------ this consists of the…

python scikit-learn train-test-split

asked Jul 01 '18 at 17:58

Nakul Sharma

votes

2 answers

Process for oversampling data for imbalanced binary classification

I have about a 30% and 70% for class 0 (minority class) and class 1 (majority class). Since I do not have a lot of data, I am planning to oversample the minority class to balance out the classes to become a 50-50 split. I was wondering if…

machine-learning scikit-learn classification train-test-split imbalanced-data

asked Jun 27 '18 at 13:46

Jane Sully

3,137
10
48
87

votes

1 answer

Python: ValueError too many values to unpack (expected 2)

I am trying to find a best xgboost model through GridSearchCV and as a cross_validation I want to use an April target data. Here is the code: x_train.head() x_train y_train.head() y_train from sklearn.model_selection import…

python machine-learning cross-validation grid-search train-test-split

asked Apr 07 '18 at 16:02

Nikita Okorokov

Prev 1 2

…

28 29 Next