Questions tagged [train-test-split]

Questions with this tag are about how to split the machine learning data set into random train and test subsets.

Questions with this tag are about how to split the machine learning data set into random train and test subsets.

In particular questions with this tag can be aimed at understanding better how to split the data with the scikit-learn functionality. In scikit-learn a random split into training and test sets can be quickly computed with the train_test_split helper function.

Reference: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

428 questions
0
votes
0 answers

R studio Knit error "incorrect number of dimensions"

I've encountered this knit issue with R studio. I have a dataset with dimension (543, 31) and I split it into train and test with: set.seed(1) train=sample(c(TRUE ,FALSE), nrow(dataset),rep=TRUE) test=(!train) y.test=y[test] And then I applied…
efsee
  • 579
  • 1
  • 10
  • 22
0
votes
1 answer

How to split and dataset into train and test and merge their corresponding "class" in R

I am using the wisconsin dataset which has two categorical columns IDs and class. In order to carry out classification I must drop these two columns from the dataframe and then split the dataset into train and test (80%:20%). I have this done but…
0
votes
2 answers

KeyError when trying to randomize a column of a dataframe

Minimal Example: Consider this dataframe temp: temp = pd.DataFrame({"A":[1,2,3,4,5,6,7,8,9,10],"B":[2,3,4,5,6,7,8,9,10,11],"C":[3,4,5,6,7,8,9,10,11,12]}) >>> temp A B C 0 1 2 3 1 2 3 4 2 3 4 5 3 4 5 6 4 5 6 7 5…
Mooncrater
  • 4,146
  • 4
  • 33
  • 62
0
votes
1 answer

Behaviour of train_test_split() from Scikit-learn

I am curious how the train_test_split() method of Scikit-learn will behave in the following scenario: An imaginary dataset: id, count, size 1, 4, 8 2, 5, 9 3, 6, 0 say I would divide it into two separate sets like this (keeping 'id' in both): id,…
NG.
  • 459
  • 1
  • 6
  • 20
0
votes
1 answer

Convert float value to integers in Pandas dataframe while ignoring null values

I have a two seperate csv files I read into a pandas dataframe. I've already done a bit of cleaning and joined the tables by their date column. I have another column called 'ExerciseTime' and converted the imported time format of the time of day…
DEB
  • 241
  • 1
  • 2
  • 7
0
votes
1 answer

Create train and test variables from loaded arff file

I want perform multilabel classification. A have a dataset in arff format which I load. However I don't now how convert import data to X and y vectors in order to apply sklearn/train_test_split. How can I get X and y? data, meta =…
0
votes
1 answer

Wrong train/test split strategy

The question is about a wrongly chosen strategy for train/test splitting in a RandomForest model. I know choosing the test set this way gives the wrong output but I would like to know why. (The model looks at previous days of data and tries to…
DBSE
  • 305
  • 3
  • 8
0
votes
1 answer

Train Test Split for a list of dataframes - Pandas

I have a list of DataFrames that I want to split into train and test sets. For a single DataFrame, I could do the following, Get the length of test split split_point = len(df)- 125 and then, train, test = df[0:split_point], df[split_point:] This…
i.n.n.m
  • 2,936
  • 7
  • 27
  • 51
0
votes
1 answer

Problems with the random-state parameter on data splitting with sklearn

When I look for the random -state parameter in sklearn's documentation, this is what I find: random_state : int or RandomState Pseudo-random number generator state used for random sampling. I don't understand very well what it is. The accuracy…
0
votes
0 answers

How to split the data in python and predict the value of next month

I have a Dataset, where I need to predict the Energy Consumption. I have the September data, and need to predict the October values. I need to predict the values of KWH for Oct. How do I write a python code, where September data would be my train…
Anagha
  • 3,073
  • 8
  • 25
  • 43
0
votes
1 answer

ValueError: bad input shape (60, 4) Iris dataset train_test_split

I received an input shape error when using train_test_split for iris. I don't understand why. I have tested other datasets. train_test_split should handle this shape. Any suggestions? Thanks # Decision Tree Classifier from sklearn import…
Muten_Roshi
  • 541
  • 2
  • 7
  • 16
0
votes
2 answers

How to get the result auc using scikit

Hi i want to combine train/test split with a cross validation and get the results in auc. My first approach I get it but with accuracy. # split data into train+validation set and test set X_trainval, X_test, y_trainval, y_test =…
xav
  • 391
  • 2
  • 10
-1
votes
0 answers

Getting Value Error inconsistent number of samples on X_train, y_train even when the shapes of the X_train, y_train are same

most likely bug-> X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) the the fit call to GridSearchCV-> gs_mnb.fit(X_train, y_train) here's the pipleline used in my code-> pipe_mnb = Pipeline([ ('vect',…
-1
votes
3 answers

why Train/Test-split in ML?

I can't understand why we need to split dataset in machine learning. And why this train-test-split algorithm gives four parameters(x_train, x_test, y_train, y_test)? I see many videos and read some blogs, they explain a lot of reasons. No one agree…
-1
votes
1 answer

How to remove cross-validation with train_test_split?

My code: X = data['text_with_tokeniz_lemmatiz'] y = data['toxic'] X_train, X_tmp, y_train, y_tmp = train_test_split(X, y, train_size=0.8, test_size=0.2, shuffle=False, random_state=12345) X_valid, X_test, y_valid, y_test = train_test_split(X_tmp,…
Kirill
  • 1
  • 2