Questions tagged [train-test-split]

Questions with this tag are about how to split the machine learning data set into random train and test subsets.

Questions with this tag are about how to split the machine learning data set into random train and test subsets.

In particular questions with this tag can be aimed at understanding better how to split the data with the scikit-learn functionality. In scikit-learn a random split into training and test sets can be quickly computed with the train_test_split helper function.

Reference: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

428 questions
0
votes
0 answers

Invalid shape when i train_test_split

I am getting shape error when I train_test_split the data the code are follows y = data['Cover_type'] X = data.drop('Cover_Type',axis=1) when I train_test_split it doesn't give me error but when i fit the GradientBoostingRegressor and find the…
Sajjad Ali
  • 33
  • 6
0
votes
0 answers

"ValueError: array length 293 does not match index length 975" while applying random forest

I am trying to apply random forest and I am getting this error "ValueError: array length 293 does not match index length 975" . Please find the code snippet below. Can anyone please help what I am doing wrong? Code: from sklearn.model_selection…
0
votes
1 answer

How can I split a dataframe using sklearn train test split such that there are equal proportions for each category?

I have a dataset with n independent variables and a categorical variable that I would like to perform a regression analysis on. The number of rows of data is different for each category. I would like to split the dataset into test and train data…
Hoppity81
  • 61
  • 8
0
votes
0 answers

My True/false statements in my dataframe change overtime in the code (tenserflow.kerax)

So im ussing NASA asteroids dataset with tenserflow.kerax for some university assignment The first thing i wanted was to standardize the data so i use (1) df = dfprime ss = StandardScaler() df_scaled = df #df.iloc[:,:-1] df_scaled =…
0
votes
1 answer

Is result of train/test split the same on different machines with set random_state?

I want to reduce randomness when training models on different machines and I was wondering if setting param random_state in sklearn rain_test_split gives always the same results. Is it dependent on system or not? So when ruining this code on…
0
votes
1 answer

Why does my cross-validation consistently perform better than train-test split?

I have the code below (using sklearn) that first uses the training set for cross-validation, and for a final check, uses the test set. However, the cross-validation consistently performs better, as shown below. Am I over-fitting on the training…
0
votes
1 answer

Error in using accuracy_score from sklearn in Logistic Regression

I am doing a Logistic Regression with the Elastic Net regularization method. I am trying to predict which variables are associated positively or negatively. An error is occurred after running the accuracy_score(y_true,y_pred), but i got an error:…
0
votes
1 answer

how to split train and test data from a .mat file in sklearn?

I have a mnist dataset as a .mat file, and want to split train and test data with sklearn. sklearn reads the .mat file as below: {'__header__': b'MATLAB 5.0 MAT-file, Platform: GLNXA64, Created on: Sat Oct 8 18:13:47 2016', '__version__': '1.0', …
BlueCurve
  • 33
  • 1
  • 7
0
votes
2 answers

Splitting data in x_train and x_test gives error: Too many values to unpack expected 2

Whenever I try to split the data into x_train and x_test I get the following error: Too many values to unpack expected 2 My code: import glob import matplotlib.pyplot as plt import numpy as np import matplotlib.image as mpimg for img in…
0
votes
1 answer

Append data to training dataset after train test split

I have split my training and test datasets using the train test split library lengths = [int(len(supervised_data)*0.8),int(len(supervised_data)*0.2)+1] train_data, test_data = torch.utils.data.random_split(supervised_data, lengths) Now I am trying…
manlike
  • 45
  • 8
0
votes
1 answer

Is it fair enough to make model evaluation based on just "train_test_split"?

I'm absolutely confused about model evaluation, interpreting its results and using cross_val_score. I don't understand why evaluation on a test set is usually considered as a final and solid result, while if we just choose other split, we'll get a…
0
votes
0 answers

Error when attempting to predict with estimator (matmul: Input operand 1 has a mismatch in its core dimension 0, with gufunc signature (n?,k),(k,m?)

I have a dataset with 329 features because of one hot encoding and I am trying to fit and predict them with a linear regression after splitting them up into training and test set. When I try to predict with my y_test I get this…
0
votes
1 answer

train/validate/test split for time series anomaly detection

I'm trying to perform a multivariate time series anomaly detection. I have training data that consists of "normal" data. I train on this data and detect anomalies on the test set that contains normal + anomalous data. My understanding is that it…
0
votes
1 answer

Issue creating data for training and testing using 3 folders containing images

I am running: path = Path('/content/drive/MyDrive/X-Ray_Image_DataSet') np.random.seed(41) data = ImageDataBunch.from_folder(dta, train="Train", valid ="Valid", ds_tfms=get_transforms(),size=(256,256), bs=32, num_workers=4).normalize() And I am…
0
votes
2 answers

Problem when splitting data: KeyError: "None of [Int64Index([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13], dtype='int64')] are in the [columns]"

I am attempting to execute a train test split on some data, wine.data but when initializing x and y: import numpy as np import matplotlib.pyplot as plt import pandas as pd from sklearn.neural_network import MLPClassifier from sklearn.model_selection…