Questions tagged [train-test-split]

Questions with this tag are about how to split the machine learning data set into random train and test subsets.

Questions with this tag are about how to split the machine learning data set into random train and test subsets.

In particular questions with this tag can be aimed at understanding better how to split the data with the scikit-learn functionality. In scikit-learn a random split into training and test sets can be quickly computed with the train_test_split helper function.

Reference: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

428 questions
0
votes
1 answer

Scikit learn Stratified Shuffle Split does not work when one of the classes has just one instance

I am trying to split my dataset into a train and a test set using scikit learn's stratified shuffle split, but it does not work because one of the classes has just one instances. It would be okay if that one instance goes into either of train or…
0
votes
1 answer

Why I am getting the error for GroupShuffleSplit (train test split)

I have 2 datasets and applying 5 different ML models. Dataset 1: def dataset_1(): ... ... bike_data_hours = bike_data_hours[:500] X = bike_data_hours.iloc[:, :-1].values y = bike_data_hours.iloc[:, -1].values X_train, X_test,…
0
votes
1 answer

Should I perform train_test_split first and then GridSearchCV and then K Fold Crossvalidation?

I am having a lot of confusion between GridSearchCV and K fold Cross Validation. I know that GridSearch is only for hyperparameter optimization and K Fold will split my data into K folds and iterate over them (cv value). So should I first split my…
spectre
  • 717
  • 7
  • 21
0
votes
1 answer

Train test split mysql records into views

how do i create two views, one for training data and the other for test data 70:30 split in mySql. CREATE VIEW training_data AS SELECT Posts.post_content as post_content, CASE WHEN (Posts.post_title like '%covid%corona%covid19%' or…
0
votes
1 answer

"ValueError: Found input variables with inconsistent numbers of samples: [40, 10]" Problem with splitting the data

I am using a sample data from a Udemy course for the sake of training. There are 51 rows in the data and I am trying to print the score of the model. The error I get is: ValueError: Found input variables with inconsistent numbers of samples: [40,…
0
votes
0 answers

Getting same feature transformation via PCA for test set fails

In an ML project you first separate out your train and test data set and you carry out all your transformation on the train data set to to make sure information leakage doesn't take place. To be more precise: X_train, X_test, y_train, y_test =…
add-semi-colons
  • 18,094
  • 55
  • 145
  • 232
0
votes
1 answer

train_test_split exception with 2D labels as stratify array

I'm trying to use the train_test_split function by providing the labels array that is a 2-d array for stratifying, with only 0 or 1 values (i.e. [0,0], [0,1], [1,0] or [1,1] are the four possible labels). I cannot rename labels (e.g. to 1,2,3,4 for…
0
votes
1 answer

Why random_state differs in test_train_split of Scikit Learn

I've been writing some code for credit card fraud detection problem using Scikit learn. I used train_test_split to split my data into training, test and valaidation data…
0
votes
1 answer

Random Forest Train Test Split Accuracy

I am working through a random forest model for the first time and have come across an issue with my accuracy quantification. Currently, I split the dataset (30% as test size), fit the model, then predict y values based on my model, and score the…
0
votes
1 answer

Using Catboost Classifier to convert categorical columns

I'm trying to apply CatBoost to one of my columns for categorical features but get following error: CatBoostError: Invalid type for cat_feature[non-default value idx=0,feature_idx=2]=68892500.0 : cat_features must be integer or string, real number…
AJ.
  • 19
  • 8
0
votes
1 answer

ValueError: Found input variables with inconsistent numbers of samples: [1319, 245]

I am facing issues related to train_test_split: final = [] final.append(dataset) final.append(dataset1) X = dataset[:,0:2] y = dataset1[:,2] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15,…
Syed Ali Abbas
  • 29
  • 1
  • 1
  • 5
0
votes
3 answers

how can I train test split in scikit learn

does anyone know what is the problem? x=np.linspace(-3,3,100) rng=np.random.RandomState(42) y=np.sin(4*x)+x+rng.uniform(size=len(x)) X=x[:,np.newaxis] from sklearn.model_selection import train_test_split X_train, X_test, y_train,…
Reza
  • 35
  • 4
0
votes
1 answer

Create random train-test split of defined proportion while maintaining exclusivity of one attribute in each set

I have multiple sets of different lengths and I wish to randomly sort these sets into two supersets such that: Any one set only appears in one superset and, The sum of the lengths of all sets in a superset is as close as possible to a defined…
0
votes
1 answer

Is there a way to solve this error concerning StratifiedShuffleSplit?

am a newbie in ML and l have been trying out the udacity ML project.However, l got an error that l am having a hard time solving. The code seems okay but l can't seem to iterate through the data. I know that its to do with the new…
0
votes
0 answers

How to apply Word2Vec on SVM

I am not sure how to fit my SVM model with Word2vec training data set ?what should I put instead of question mark in below code? model = gensim.models.Word2Vec(sentences= df['meaningful_words']) Train_X, Test_X, Train_Y, Test_Y =…
Pegah
  • 13
  • 4