Questions tagged [train-test-split]

Questions with this tag are about how to split the machine learning data set into random train and test subsets.

In particular questions with this tag can be aimed at understanding better how to split the data with the scikit-learn functionality. In scikit-learn a random split into training and test sets can be quickly computed with the train_test_split helper function.

Reference: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

428 questions

votes

0 answers

I think using train_test_split to sample a large data set and then use cross_validation on the sample may be wrong. agree?

I am trying to solve the DAT102x: Predicting Mortgage Approvals From Government Data since a couple of months. My goal is to understand the pieces of a classification problem, not to rank to the top. However, I found something that is not clear to…

asked Jun 12 '19 at 08:50

CRAZYDATA

votes

0 answers

ValueError: Found input variables with inconsistent numbers of samples: [25707, 25000]

I have this below error when trying to Apply this code below : I am doing a tutorial based on this page : https://towardsdatascience.com/sentiment-analysis-with-python-part-1-5ce197074184 File "reviewsML.py", line 58, in X_train,…

machine-learning scikit-learn python-3.6 linear-regression train-test-split

asked Jun 03 '19 at 14:32

kely789456123

votes

1 answer

What should be passed as input parameter when using train-test-split function twice in python 3.6

Basically i wanted to split my dataset into training,testing and validation set. I therefore have used train_test_split function twice. I have a dataset of around 10-Million rows. On the first split i have split training and testing dataset into…

python machine-learning classification train-test-split

asked May 12 '19 at 13:00

Logica

votes

1 answer

Steps to perform correct data analysis

I have a dataset with 69 columns and 50000 rows. My dataset only contains binary variables and numerical variables. Moreover, some of the binary variables have some missing values (about 5%). I know I should divide the dataset into…

r missing-data imputation train-test-split

asked Apr 21 '19 at 11:36

user11390769

votes

1 answer

Bunch object not callable - scikit-learn rcv1 dataset

I want to split the train and test set for RCV1 inbuilt dataset and apply k-means algorithm, however while trying to split the data, an error is shown saying bunch object not callable from sklearn.datasets import fetch_rcv1 rcv1 =…

scikit-learn dataset train-test-split

asked Mar 24 '19 at 04:59

user11040323

votes

1 answer

For a day ahead basis prediction model evaluation. For my train test split, do i do an 80:20 split or do a (rest of the days : last day) split?

I have time series data for 3 months, in 15 minute intervals. (one day has 96 time slots) I have Temperature column[Temp] and Solar irradiance[SI](sun intensity) column. My model has to predict temperature on a 'day-ahead' basis for the entire…

machine-learning time-series train-test-split

asked Feb 02 '19 at 07:57

Rex

votes

1 answer

Why using database (redis, SQL) would help when loading big data and RAM is running out of memory?

I need to take 100 000 images from a directory, put them all in one big dictionary where the keys are the ids of the pictures and the values are the numpy arrays of the pixels of the images. Creating this dict takes 19 GB of my RAM and I have 24GB…

python ram train-test-split

asked Jan 04 '19 at 09:42

mitevva_t

votes

1 answer

Data split in train, validation and test in subject independent 10-fold cross validation?

I am working on emotion analysis. Recent papers in this area perform subject independent k-fold cross validation. But I have not seen any paper which uses validation set. They only mention train set and test set. For example, in 10 cross validation,…

dataset cross-validation hyperparameters train-test-split

asked Dec 18 '18 at 02:47

manv

votes

2 answers

Scikit-learn train_test_split inside multiprocessing pool on Linux (armv7l) does not work

I am experiencing some weird behaviour using the train_test_split inside a multiprocessing pool, when running Python on the Rasbperry Pi 3. I have something like this: def evaluate_Classifier(model,Features,Labels,split_ratio): X_train, X_val,…

python scikit-learn raspberry-pi multiprocessing train-test-split

asked Sep 12 '18 at 12:32

vzografos

votes

0 answers

Train/validation split with a predefined mixture of target variable

I want to be able to make train/validation splits with a user-defined mixture of target variable. StratifiedKFold and StratifiedShuffleSplit from sklearn keep the mixture from the original sample. But on kaggle or in real life we often have a…

python scikit-learn cross-validation train-test-split

asked Sep 04 '18 at 05:56

Mischa Lisovyi

3,207
18
29

votes

1 answer

train_test_split not removing y train and test variables after index slicing

I've used train_test_split() numerous times with index slicing, but for some reason it's retaining the predictor values for both y train and test sets. Below is example data, along with train/test slicing and shapes. Original data example:…

python pandas one-hot-encoding train-test-split

asked Aug 22 '18 at 15:57

Mr. Jibz

votes

1 answer

Confused about the use of validation set here

For the main.py of the px2graph project, the part of training and validation is shown as below: splits = [s for s in ['train', 'valid'] if opt.iters[s] > 0] start_round = opt.last_round - opt.num_rounds # Main training loop for round_idx in…

validation tensorflow train-test-split

asked Aug 15 '18 at 21:07

Panfeng Li

3,321
3
26
34

votes

1 answer

Creating test & train set while keeping certain items together in one set

I have a dataset consisting of around 500 different paragraphs. For each paragraph, I am trying to see whether there is a link to any of the other paragraphs. Based on this I've created paragraph pairs. I previously tried to approach this problem as…

python pandas scikit-learn train-test-split

asked May 25 '18 at 12:16

Mia

votes

0 answers

Train and test data split in r but not randomly

I want to split data in training ans testing but not randomly. I want first 80% of rows should be treated as training and rest as testing. rows=nrow(data) index=0.80*row train=data[1:index] Can anybody help?

r train-test-split

asked May 07 '18 at 19:46

user15051990

1,835
2
28
42

votes

2 answers

Python / How to delete specific rows in testing data with indices after / train / test / split

I want to delete in X_test and in y_test every row where MFD is bigger one. The problem is, that i always get the random mixed indices from Train / Test / Split. If i try to drop it i get the following Error Message: IndexError: index 3779 is out of…

python scikit-learn delete-row indices train-test-split

asked Apr 17 '18 at 13:27

Florian Notar

Prev 1 2 3

…

28 29 Next