Questions tagged [train-test-split]

Questions with this tag are about how to split the machine learning data set into random train and test subsets.

Questions with this tag are about how to split the machine learning data set into random train and test subsets.

In particular questions with this tag can be aimed at understanding better how to split the data with the scikit-learn functionality. In scikit-learn a random split into training and test sets can be quickly computed with the train_test_split helper function.

Reference: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

428 questions
2
votes
4 answers

Split data into training and testing not randomly

I want to split my dataset into two parts, 75% for training and 25% for testing. There are two classes. And I have another dataset that has only one instance of one class, rest all instances belong to second class. So I dont want to split randomly.…
2
votes
4 answers

Split into training and testing set in R?

How can I write the following written code in python into R ? X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) Spliting into training and testing set…
Keshav Maheshwari
  • 85
  • 1
  • 2
  • 12
2
votes
2 answers

Why my model work ok with test data from train_test_split while doesn't with the new data?

I am new to machine learning. I have a continuous dataset. I am trying to model the target label using several features. I utilize the train_test_split function to separate the train and the test data. I am training and testing the model using the…
2
votes
0 answers

Duplicating pandas.get_dummies columns from train to test data

I have two dataframes, train and test. They both have the same exact column names which contain categorical string features. I'm trying to map these features to dummy variables in the training set, train a regression model, then do the same exact…
Austin
  • 6,921
  • 12
  • 73
  • 138
1
vote
1 answer

How to combine X_train and y_train into one balanced dataframe in Pyhton?

I would highly appreciate your advise with this: I have imbalanced dataset: y has only 2% of 1. I want to balance only the train dataset and afterwards to perform on the balanced train dataset feature selection prior to the model. After performing…
Ella
  • 13
  • 4
1
vote
1 answer

"Found input variables with inconsistent numbers of samples" Have I done something wrong during the train_test_split?

I am trying to logistic Regression Model, and run some test but I keep getting this error. Not really sure what I have done differently to everyone else from sklearn import preprocessing X = df.iloc[:,:len(df.columns)-1] y =…
1
vote
1 answer

Can someone help explain why my MLP keeps on getting a perfect classification report?

I am using Sklearn.train_test_split and sklearn.MLPClassifier for human activity recognition. Below is my dataset in a pandas df: a_x a_y a_z g_x g_y g_z activity 0 3.058150 5.524902 -7.415221 0.001280 -0.022299 -0.009420 sit 1 …
JP1990
  • 35
  • 5
1
vote
0 answers

How is train test split in xgboost cv specified?

It is to be noted that the xgboost.cv method returns eval metrics on both train and test sets whereas the function itself takes no parameter stating which dataset to be used for training and which for testing. The xgboost.cv method takes only dtrain…
wasif
  • 65
  • 4
  • 12
1
vote
2 answers

Split rows in train test based on user id PySpark

I have a PySpark dataframe containing multiple rows for each user: userId action time 1 buy 8 AM 1 buy 9 AM 1 sell 2 PM 1 sell 3 PM 2 sell 10 AM 2 buy 11 AM 2 sell 2 PM 2 sell 3 PM My goal is to split this dataset into a…
mht
  • 381
  • 1
  • 2
  • 12
1
vote
0 answers

k-fold implementation with train test split

I am trying to put kfold to my code as overfitting is an issue. Previously i have split my data into train test . But i am getting confused where and how to apply k-fold as my data is already split. x_norm = preprocessing.normalize(x,…
1
vote
1 answer

Reshape your data either using array.reshape(-1, 1) during model.predict()?

I'm trying to run a number of classification models, but all of them keep throwing the reshape error. I think it has to do with the calculation of model.score or model.predict but i've tried running some reshape commands (on X_valid and Y_valid)…
Brian
  • 107
  • 8
1
vote
1 answer

how to use an explicit validation set with predefined split fold?

I have explicit train, test and validation sets as 2d arrays: X_train.shape (1400, 38785) X_val.shape (200, 38785) X_test.shape (400, 38785) I am tuning the alpha parameter and need advice about how I can use the predefined validation set in…
1
vote
0 answers

Cannot fit a Model after Performing Stratified K-Fold Split

I am new to the concept of using K-folds to split into train and test data, which I am practicing with the dataset below. Context: The Dataset is the Kaggle UrbanSound8k set available at https://www.kaggle.com/datasets/chrisfilo/urbansound8k I am…
1
vote
1 answer

Data Cardinality keras odd number of images- train test split

My autoencoder shows a "Valueerror: Data cardinality is ambiguous: x sizes: 14 y sizes: 31 Make sure all arrays contain the same number of samples." split_size_i = int(images.shape[0]*0.7) split_size =…
1
vote
0 answers

5-fold cross validation from sklearn with train, val, and test sets and ratio of 60/20/20

I am able to create train, validation, and test sets for one fold experiments using sklearn like below with train, val and test having a ratio of 60/20/20: x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.4,…