Questions tagged [train-test-split]

Questions with this tag are about how to split the machine learning data set into random train and test subsets.

Questions with this tag are about how to split the machine learning data set into random train and test subsets.

In particular questions with this tag can be aimed at understanding better how to split the data with the scikit-learn functionality. In scikit-learn a random split into training and test sets can be quickly computed with the train_test_split helper function.

Reference: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

428 questions
0
votes
2 answers

Do you have to clean your test data before feeding into an NLP model?

This is a natural language processing related question. Suppose I have a labelled train and unlabelled test set. After I have cleaned my train data(stopword, stem, punctuations etc), I use this cleaned data to build my model. When fitting it on my…
0
votes
1 answer

Split dataset containing multiple labels

I have a dataset with multiple labels, ie for each X I have 2 y and I need to split into train and test set. I tried with the sklearn function train_test_split(): import numpy as np from sklearn.model_selection import train_test_split X =…
tobor
  • 3
  • 1
0
votes
1 answer

Split values in a column in "is a date" or "NaT"

I would like to find values in a column (clear_date) that do not correspond to a valid date. The date is formatted as '%Y/%m/%d'. I've tried the following piece of code but, the resulting variable doesn't have any rows! x_test =…
0
votes
1 answer

Why results are inaccurate when I am using different dataset for testing a model in Machine Learning?

I am trying to do forecasting based on time series. I am doing temperature forecasting by using the past three years of hourly data. Instead of using X_test from train_test_split method, I am using my own test dataset because I need seven-day ahead…
0
votes
1 answer

Stratified train/test-split with guaranteed inclusion of small classes on strongly imbalanced datasets

I am working with large-scale, imbalanced datasets where I need to pick a stratified training set. However, even if the dataset is strongly imbalanced, I still need to ensure that at least every label class is included at least once in the training…
Andreas
  • 736
  • 6
  • 15
0
votes
0 answers

Split dataset to train and test for a LDA model

I have a dataset that contains about 17000 of user data scraped from twitter and I am working with the latent dirichlet allocation algorithm. I want to split my dataset but I am not sure what is the best way. What are the criteria to split a dataset…
0
votes
0 answers

Patsy Dmatrices X, y split

Using patsy.dmatrices to split my data into y,x and I am losing observations. Ex: formula = 'target ~ v1 + v2 + v3' y, x = patsy.dmatrices(formula, df, return_type = 'dataframe') My df.shape is ~ 54,000,000 length, however following x/y split, my…
0
votes
1 answer

kernel gets stuck if I train/test split by 55% and 45%

I am trying to train a neural net on a dataset. Everything works. There is no issue with the code if I specifiy 70% or 50% percent of the data as training and the rest as testing. But as I specify 55% and 45% for training and testing, the kernel…
Saad Zaheer
  • 171
  • 7
0
votes
1 answer

How to install sensplit on google colab?

How to install sensplit on google colab ? I already cloned the git repository on google colab but I couldn't use the sensplit package , when I run the !pip install sensplit it returns errors. Please, I need a hint. Thanks in advance
0
votes
1 answer

Why the line not cut across the data?

I using linear regression model to predict my data. Orig Data When I using sns plot; I able to see the line cut's thru to all the data point. Using snsborn.lmplot But when I using train_test_split function: The coeff & interc as below : Weight = …
Tep66
  • 9
  • 1
0
votes
1 answer

Split train/test on based on comparison operators

I'm trying to figure out how to split the data based on these conditions in order to run a CNN on this: Split the training/testing dataset into two sets: one with class labels < 5 and one with class labels >= 5. Print out the shapes of the resulting…
0
votes
1 answer

I keep on getting the error name 'y_test' is not defined

I really need your help! I've written this code: from sklearn.model_selection import train_test_split from sklearn import metrics from sklearn.metrics import accuracy_score def train_test_rmse(x,y): X = df_new[feature_cols] y =…
0
votes
1 answer

Can we tune any of the parameters on testing data, including any parameters learned by preprocessing?

I want to normalize the data using StandardScaler function. But I have doubts about how this should be done. One way to do this is like as follows: scaler = StandardScaler().fit(X) X = scaler.transform(X) X_train, X_test, y_train,…
0
votes
1 answer

Python - Predicting test data that is smaller than train data

I have preprocessed some data ready to train a Multinomial Naive Bayes classification. The train data is 80% of my data and the test data is 20%. The train data is an array of size 8452 and the test data is an array of size of 4231 If I want to see…
apol96
  • 200
  • 12
0
votes
1 answer

split dataset into train and test using tensorflow

I want to split my full dataset(every raw data has multiple features) into train and test sets. Rather than using scikit-learn 's train-test-split is there any other proper way to split my data? as well as I need to shuffle my data when…
Dale Steyn
  • 51
  • 1
  • 5