Questions tagged [train-test-split]

Questions with this tag are about how to split the machine learning data set into random train and test subsets.

Questions with this tag are about how to split the machine learning data set into random train and test subsets.

In particular questions with this tag can be aimed at understanding better how to split the data with the scikit-learn functionality. In scikit-learn a random split into training and test sets can be quickly computed with the train_test_split helper function.

Reference: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

428 questions
2
votes
1 answer

ValueError: y should be a 1d array, got an array of shape (74216, 2) instead

I am trying to apply Logistic Regression Models with text. I Vectorized my data by TFIDF: vectorizer = TfidfVectorizer(max_features=1500) x = vectorizer.fit_transform(df['text_column']) vectorizer_df = pd.DataFrame(x.toarray(),…
2
votes
0 answers

How to fix: 'ValueError: Found input variables with inconsistent numbers of samples'

For predicting house prices using linear regression, I am not able to train the model using model.fit() as it gives me an error. Here is my code: #importing dependencies import pandas as pd import numpy as np from sklearn.linear_model import…
2
votes
2 answers

Scaling row-wise with MinMaxScaler from Sklearn

By default, scalers from Sklearn work column-wise. But i need my data to be scaled line-wise, so i did the following: from sklearn.preprocessing import MinMaxScaler from sklearn.model_selection import train_test_split import numpy as np # %%…
Murilo
  • 533
  • 3
  • 15
2
votes
1 answer

ValueError: too many values to unpack(expected 2) - train_test_split

I'm doing test_split before the feature extraction. however, when I try to loop through any set, whether train or test, I get the following error (ValueError: too many values to unpack(expected 2)) for cls in os.listdir(path): for sound…
Ran
  • 57
  • 6
2
votes
1 answer

sklearn train_test_split on pandas

I'm a relatively new user to sklearn and have question about using train_test_split from sklearn.model_selection. I have a large dataframe that has shape of (96350, 156). In my dataframe is column named CountryName that contains 160 countries, each…
leskovecg
  • 83
  • 8
2
votes
0 answers

equivalent of sklearn's StratifiedGroupKFold for PySpark?

I have a dataframe for single-label binary classification with some class imbalance and I want to make a train-test split. Some observations are members of groups in the data that should only appear in either the test split or train split but not…
2
votes
1 answer

Stratified Cross Validation or Sampling for train-test split based on multiple features in python

sklearn's train_test_split , StratifiedShuffleSplit and StratifiedKFold all stratify based on class labels (y-variable or target_column). What if we want to sample based on features columns (x-variables) and not on target column. If it was just one…
2
votes
1 answer

How do I best make %80 train, %10 validation, and %10 percent test splits using train_test_split in Python?

How do I best make %80 train, %10 validation, and %10 percent test splits using train_test_split in Python? Is there a common way to visualize this split once created? from sklearn.model_selection import train_test_split # Splitting the data by a…
iceAtNight7
  • 194
  • 1
  • 2
  • 10
2
votes
2 answers

How to split duplicate samples to train test with no overlapping?

I have a nlp datasets (about 300K samples) where there exits duplicate data. I want to split it to train test (70%-30%), and they should have no overlapping. For instance: |dataset: | train | test | | a | a | …
Whisht
  • 681
  • 2
  • 6
  • 20
2
votes
1 answer

Differnce between train_test_split and StratifiedShuffleSplit

I came across the following statement when trying to find the differnce between train_test_split and StratifiedShuffleSplit. When stratify is not None train_test_split uses StratifiedShuffleSplit internally, I was just wondering why the…
adiaux
  • 103
  • 8
2
votes
5 answers

YoloV4 Custom Dataset Train Test Split

I try to train a Yolo Net with my custom Dataset. I have some Images (*.jpg) and the labels/annotations in the yolo format as a txt-file. Now I want to split the data in a train and validation set. As a result I want a train and a validation folder…
Basti
  • 45
  • 2
  • 8
2
votes
1 answer

train_test_split for multiple targets

I have multiobjective problem. I have two targets ylo and yhi sharing the same features x: x = np.array([[0,1,2],[2,3,4]]) ylo = np.array([10,11]) yhi = np.array([12,13]) is there a way to split the data to get x_train,…
Hud
  • 301
  • 1
  • 12
2
votes
2 answers

Train test split for ensuring all categories are included in train set

Let's say there are some 20 categorical columns in the data, each having a different set of unique categorical values. Now a train test split has to done, and one needs to ensure that all unique categories are included in the train set. How can it…
Aroonima
  • 21
  • 1
  • 4
2
votes
2 answers

How to solve sklearn error: "Found input variables with inconsistent numbers of samples"?

I have a challenge using the sklearn 70-30 division. I receive an error on line: X_train, X_test, y_train, y_test = train_test_split(X_smote, y_smote, test_size=0.3, stratify=y) The error is: Found input variables with inconsistent numbers of…
Paip
  • 21
  • 1
  • 3
2
votes
2 answers

Split Train Test Data sets keeping like values together

I have a data set of animal types with ID's and I want to break said data set into Test/Train data sets. I also want to keep all ID's for a respective animal within either the Train or Test data set. An example of the data is below with a random…
AlmostThere
  • 557
  • 1
  • 11
  • 26