Questions tagged [train-test-split]

Questions with this tag are about how to split the machine learning data set into random train and test subsets.

Questions with this tag are about how to split the machine learning data set into random train and test subsets.

In particular questions with this tag can be aimed at understanding better how to split the data with the scikit-learn functionality. In scikit-learn a random split into training and test sets can be quickly computed with the train_test_split helper function.

Reference: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

428 questions
4
votes
1 answer

Difference between doing cross-validation and validation_data/validation_split in Keras

First, I split the dataset into train and test, for example: X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.4, random_state=999) I then use GridSearchCV with cross-validation to find the best performing…
Long
  • 1,482
  • 21
  • 33
4
votes
2 answers

how can I split data in 3 or more parts with sklearn

I want to split data into train,test and validation datasets which are stratification, but sklearn only provides cross_validation.train_test_split which only can divide into 2 pieces. What should i do if i want do this
4
votes
1 answer

How to split a dataset to train/test where some rows are dependent?

I have a data set of subjects and each of them has a number of rows in my pandas dataframe (each measurement is a row and a subject could measure a few times). I would like to split my data into training and test set but I cannot split randomly…
AR_
  • 468
  • 6
  • 18
3
votes
1 answer

Stratified train-test splitting a Tensorflow dataset

I am currently working with a quite large image-dataset and I loaded it using ImageDataGenerator from tensorflow.keras in python. As the classification of my data is very imbalanced I wanted to do a stratified train-test-split to possibly achieve a…
3
votes
1 answer

How to ensure all samples from specific group are all togehter in train/test in sklearn cross_val_predict?

I have a dataframe, where each sample belong to a group. For example: df = a b c group 1 1 2 G1 1 6 1 G1 8 2 8 G3 2 8 7 G2 1 9 2 G2 1 7 2 G3 4 0 2 G4 1 5 1 G4 6 7 8 G5 3 3 7 G6 1 2 2 …
Cranjis
  • 1,590
  • 8
  • 31
  • 64
3
votes
1 answer

Train Test Split sklearn based on group variable

My X is as follows: EDIT1: Unique ID. Exp start date. Value. Status. 001 01/01/2020. 4000. Closed 001 12/01/2019 4000. Archived 002 01/01/2020. 5000. Closed 002 12/01/2019 …
Zee
  • 81
  • 1
  • 8
3
votes
2 answers

How to split images into test and train set using my own data in TensorFlow

I am a little confused here... I just spent the last hour reading about how to split my dataset into test/train in TensorFlow. I was following this tutorial to import my images: https://www.tensorflow.org/tutorials/load_data/images. Apparently one…
Guillermina
  • 3,127
  • 3
  • 15
  • 24
3
votes
2 answers

Is it a flaw that Optuna examples return the evaluation metric of the test set?

I am using Optuna for parameter optimization for some models. In almost all the examples the objective function returns a evaluation metric on the TEST set, and tries to minimize/maximize this. I feel like this is a flaw in the examples since…
brian
  • 31
  • 2
3
votes
0 answers

sklearn train_test_split dies and shuts down python kernel

I am struggling with using the train_test_split function from scikit-learn with 3d Numpy arrays. I have a feature array with shape (1860000, 144, 12) and a label array with shape (1860000,). In a different case train_test_split works well. But when…
MadPhil
  • 31
  • 2
3
votes
1 answer

How to split a dataset into a train and test dataset using hashcode method

I am following the code of the Hands on Machine learning with Sci-kit learn and tensorflow 2nd edition. In the creating train and test dataset section they followed this procedure to create the training and testing dataset as follows: from zlib…
I. A
  • 2,252
  • 26
  • 65
3
votes
1 answer

Scikit-Learn GroupShuffleSplit is not grouping by specified groups

I am trying to split a timeseries of farm data taken at a daily frequency for 8 years. I want to split the data so that the train and test sets each contain samples from different farms, and there is no overlap of farms between the train and test…
Wonton
  • 339
  • 1
  • 2
  • 10
3
votes
2 answers

Why accuracy of GridSearchCV method is lower than standard method?

I use train_test_split (random_state = 0) and decision tree without any parameter tuning to model my data, I run it about 50 times to achieve the best accuracy. import pandas as pd import numpy as np from sklearn import tree from sklearn.tree…
3
votes
2 answers

train_test_split not splitting data

There is a dataframe that consists of 14 columns in total, the last column is the target label with integer values = 0 or 1. I have defined: X = df.iloc[:,1:13] ---- this consists of the feature values y = df.iloc[:,-1] ------ this consists of the…
Nakul Sharma
  • 143
  • 2
  • 9
3
votes
2 answers

Process for oversampling data for imbalanced binary classification

I have about a 30% and 70% for class 0 (minority class) and class 1 (majority class). Since I do not have a lot of data, I am planning to oversample the minority class to balance out the classes to become a 50-50 split. I was wondering if…
3
votes
1 answer

Python: ValueError too many values to unpack (expected 2)

I am trying to find a best xgboost model through GridSearchCV and as a cross_validation I want to use an April target data. Here is the code: x_train.head() x_train y_train.head() y_train from sklearn.model_selection import…
1 2
3
28 29