Questions tagged [train-test-split]

Questions with this tag are about how to split the machine learning data set into random train and test subsets.

Questions with this tag are about how to split the machine learning data set into random train and test subsets.

In particular questions with this tag can be aimed at understanding better how to split the data with the scikit-learn functionality. In scikit-learn a random split into training and test sets can be quickly computed with the train_test_split helper function.

Reference: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

428 questions
0
votes
1 answer

Small Dataset, Train Test Split or Train Val and Test?

I did some forecasting (stock) for my thesis. I only used a fix amount of 600 Samples (can't change that). Because of the small dataset i only did a Train and Test Split (no validation etc.). I found some settings where i get very good results (MAPE…
A M
  • 23
  • 4
0
votes
1 answer

Keras CNN-LSTM : Error while making y_train

This is my first time asking a question here (that's mean I'm really need help) and sorry for my bad English. I want to make a cnn-lstm layer for video classification in Keras but I have a problem on making my y_train. I will describe my problem…
0
votes
1 answer

training and test dataset given as 4 different dataset

I'm a newbie to python and would very much appreciate some assistance. It's about logistic regression (machine learning) I have no problem up until training the algorithm. The data sets are as follows: The cost_train dataframe contains the target…
Jay
  • 1
0
votes
1 answer

Does cross validation + early stopping show the actual performance for small sample?

I'm running xgboost on some simulation, where my sample size is 125. I was measuring the 5-fold cross validation error, i.e., in each round my training sample size is 100 and testing sample size is 25. Assuming all other parameters are fixed but the…
user3029790
  • 311
  • 1
  • 8
0
votes
1 answer

Train / Val / Test split time LSTM

I have a data set made of several months (from JAN-15 do SEPT-17), reporting a customer financial situation for each month. My task it to predict the cumulative sales for each customer for the next 12 months. My dataset looks like this (this is the…
0
votes
2 answers

How to use GridSearchCV for tuning parameters with train_test_split strategy?

I am trying to fine tune my sklearn models using train_test_split strategy. I am aware of GridSearchCV's ability to perform parameter tuning, however, it was tied to using Cross Validation strategy, I would like to use train_test_split strategy for…
Alex Ramses
  • 538
  • 3
  • 19
0
votes
1 answer

Should the same imputer co-efficients be used for training and test datasets?

I am learning how to prepare data, build estimators and check using a train/test data split. My question is how I can prepare the test dataset correctly. I split my data into a test and a training set. And as "Hands on with machine learning with…
talkingtoaj
  • 848
  • 8
  • 27
0
votes
1 answer

Why the model accuracy is different while splitting data with different approach in LightGBM?

I am creating a lightGBM Model for prediction using Python. Initially, i did the data split using sklearn.model_selection.train_test_split which resulted into lower Mean absolute error(MAE). Later, i did the split in some other way by splitting the…
CSK
  • 67
  • 3
  • 11
0
votes
3 answers

How do I split the data into the first 808698 rows of the train and the rest as a test?

I have two datasets which are test and train. I gathered them in one csv. I want to split my data for train and test. But it should'nt be random. I need to split first 808699 rows of the train and the rest as a test? I tried to read two different…
0
votes
1 answer

I am getting an error as 'ValueError: x and y must be the same size' when trying to plot a scatter plot

I am trying to perform linear regression on Black friday dataset. When I get to the model training part, I tried to split my data set defining the X and y values and later performing the train test split. And then I train my model using linear…
0
votes
3 answers

What is the correct procedure to split the Data sets for classification problem?

I am new to Machine Learning & Deep Learning. I would like to clarify my doubt related to train_test_split before training I have a data set of size (302, 100, 5), where, (207,100,5) belongs to class 0 (95,100,5) belongs to class 1. I would like…
Mari
  • 698
  • 1
  • 8
  • 27
0
votes
1 answer

y_test values from train_test split output

I have done a test train split & now i am trying to do a comparison & get the difference between predicted & actual as a list & sending that into excel. I am doing all this with a function as shown in the attached pic (the inbuilt functions are need…
moys
  • 7,747
  • 2
  • 11
  • 42
0
votes
1 answer

How to fix "The least populated class in y has only one member" Scikit learn

I am creating a program using past datasets to predict an employees salary for any job. I recieve the error "Warning: The least populated class in y has only 1 members, which is too few. The minimum number of members in any class cannot be less than…
0
votes
2 answers

How to fix 'ValueError: Found input variables with inconsistent numbers of samples: [32979, 21602]'?

I am making a Logistic Regression model to do sentiment analysis. This is the problem - ValueError: Found input variables with inconsistent numbers of samples: [32979, 21602] This occurs when I try to split my dataset into x and y train and valid…
0
votes
2 answers

How to to implement train test split without overlaps in apache beam?

I would like to train test split a list of texts with the associated entities so there are no entities overlapping splits. Ensuring no overlaps is challenging I currently achieve it with 2 groupby operations. I was wondering how I can mitigate the…
swartchris8
  • 620
  • 1
  • 6
  • 24