Questions tagged [train-test-split]

Questions with this tag are about how to split the machine learning data set into random train and test subsets.

Questions with this tag are about how to split the machine learning data set into random train and test subsets.

In particular questions with this tag can be aimed at understanding better how to split the data with the scikit-learn functionality. In scikit-learn a random split into training and test sets can be quickly computed with the train_test_split helper function.

Reference: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

428 questions
0
votes
0 answers

I think using train_test_split to sample a large data set and then use cross_validation on the sample may be wrong. agree?

I am trying to solve the DAT102x: Predicting Mortgage Approvals From Government Data since a couple of months. My goal is to understand the pieces of a classification problem, not to rank to the top. However, I found something that is not clear to…
0
votes
0 answers

ValueError: Found input variables with inconsistent numbers of samples: [25707, 25000]

I have this below error when trying to Apply this code below : I am doing a tutorial based on this page : https://towardsdatascience.com/sentiment-analysis-with-python-part-1-5ce197074184 File "reviewsML.py", line 58, in X_train,…
0
votes
1 answer

What should be passed as input parameter when using train-test-split function twice in python 3.6

Basically i wanted to split my dataset into training,testing and validation set. I therefore have used train_test_split function twice. I have a dataset of around 10-Million rows. On the first split i have split training and testing dataset into…
Logica
  • 977
  • 4
  • 16
0
votes
1 answer

Steps to perform correct data analysis

I have a dataset with 69 columns and 50000 rows. My dataset only contains binary variables and numerical variables. Moreover, some of the binary variables have some missing values (about 5%). I know I should divide the dataset into…
0
votes
1 answer

Bunch object not callable - scikit-learn rcv1 dataset

I want to split the train and test set for RCV1 inbuilt dataset and apply k-means algorithm, however while trying to split the data, an error is shown saying bunch object not callable from sklearn.datasets import fetch_rcv1 rcv1 =…
user11040323
0
votes
1 answer

For a day ahead basis prediction model evaluation. For my train test split, do i do an 80:20 split or do a (rest of the days : last day) split?

I have time series data for 3 months, in 15 minute intervals. (one day has 96 time slots) I have Temperature column[Temp] and Solar irradiance[SI](sun intensity) column. My model has to predict temperature on a 'day-ahead' basis for the entire…
Rex
  • 117
  • 8
0
votes
1 answer

Why using database (redis, SQL) would help when loading big data and RAM is running out of memory?

I need to take 100 000 images from a directory, put them all in one big dictionary where the keys are the ids of the pictures and the values are the numpy arrays of the pixels of the images. Creating this dict takes 19 GB of my RAM and I have 24GB…
mitevva_t
  • 139
  • 3
  • 12
0
votes
1 answer

Data split in train, validation and test in subject independent 10-fold cross validation?

I am working on emotion analysis. Recent papers in this area perform subject independent k-fold cross validation. But I have not seen any paper which uses validation set. They only mention train set and test set. For example, in 10 cross validation,…
manv
  • 138
  • 1
  • 10
0
votes
2 answers

Scikit-learn train_test_split inside multiprocessing pool on Linux (armv7l) does not work

I am experiencing some weird behaviour using the train_test_split inside a multiprocessing pool, when running Python on the Rasbperry Pi 3. I have something like this: def evaluate_Classifier(model,Features,Labels,split_ratio): X_train, X_val,…
0
votes
0 answers

Train/validation split with a predefined mixture of target variable

I want to be able to make train/validation splits with a user-defined mixture of target variable. StratifiedKFold and StratifiedShuffleSplit from sklearn keep the mixture from the original sample. But on kaggle or in real life we often have a…
0
votes
1 answer

train_test_split not removing y train and test variables after index slicing

I've used train_test_split() numerous times with index slicing, but for some reason it's retaining the predictor values for both y train and test sets. Below is example data, along with train/test slicing and shapes. Original data example:…
Mr. Jibz
  • 511
  • 2
  • 7
  • 21
0
votes
1 answer

Confused about the use of validation set here

For the main.py of the px2graph project, the part of training and validation is shown as below: splits = [s for s in ['train', 'valid'] if opt.iters[s] > 0] start_round = opt.last_round - opt.num_rounds # Main training loop for round_idx in…
Panfeng Li
  • 3,321
  • 3
  • 26
  • 34
0
votes
1 answer

Creating test & train set while keeping certain items together in one set

I have a dataset consisting of around 500 different paragraphs. For each paragraph, I am trying to see whether there is a link to any of the other paragraphs. Based on this I've created paragraph pairs. I previously tried to approach this problem as…
Mia
  • 559
  • 4
  • 9
  • 21
0
votes
0 answers

Train and test data split in r but not randomly

I want to split data in training ans testing but not randomly. I want first 80% of rows should be treated as training and rest as testing. rows=nrow(data) index=0.80*row train=data[1:index] Can anybody help?
user15051990
  • 1,835
  • 2
  • 28
  • 42
0
votes
2 answers

Python / How to delete specific rows in testing data with indices after / train / test / split

I want to delete in X_test and in y_test every row where MFD is bigger one. The problem is, that i always get the random mixed indices from Train / Test / Split. If i try to drop it i get the following Error Message: IndexError: index 3779 is out of…