Questions tagged [train-test-split]

Questions with this tag are about how to split the machine learning data set into random train and test subsets.

Questions with this tag are about how to split the machine learning data set into random train and test subsets.

In particular questions with this tag can be aimed at understanding better how to split the data with the scikit-learn functionality. In scikit-learn a random split into training and test sets can be quickly computed with the train_test_split helper function.

Reference: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

428 questions
0
votes
2 answers

Tensorflow auto split image

suppose I have directories like this. full_dataset |---horse <= 40 images of horse |---donkey <= 50 images of donkey |---cow <= 80 images of cow |---zebra <= <= 30 images of zebra Then I write this with tensorflow image_generator =…
Ichsan
  • 768
  • 8
  • 12
0
votes
1 answer

train test data split using stratify on two columns in scikit-learn

I have a dataset that I want to split into train and test so that I have data in the test set from each data source (specified in column "source") and from each class (specified in column "class"). I read about using the parameter stratifiy with…
A_Matar
  • 2,210
  • 3
  • 31
  • 53
0
votes
0 answers

Splitting the datatset for classification

I am trying to train and test classification model, however, I don't understand why I am getting this error: ValueError: The test_size = 9 should be greater or equal to the number of classes = 11 What does this error mean? My code for splitting the…
Momo
  • 31
  • 7
0
votes
2 answers

Is there a way to do a stratified train/test split without shuffling the data?

I'm using time sensitive data and would like to maintain the order of the data but stratifying the data since I've got multiple labels. I haven't found any libraries that allow this.
0
votes
0 answers

Is it possible to shuffle a dataframe while using while grouping by index in pandas or sklearn?

I have dataframe df, containing patient data, as shown below: | patient_id | x | y | path | target | |------------ |----- |----- |------ |-------- | | 4423 | 234 | 53 | .... | 1 | | 4423 | 259 |…
0
votes
3 answers

Is it possible to train data on 4 features and test on only using features?

I have done training on four features including Month, day, Hour and Temperature which is predicting some value , what i wan to do is to predict value on basis of month ,hour and day of next day only because i don't know the temp of next day(which…
0
votes
1 answer

K-folds do we still need to implement train_test_split?

I've been reading quite a bit and i'm a little confused with k-folds. I understand the concept behind it, but i'm not sure about how to deploy it. The usual step that i've been seeing after data exploration is train_test_split, encoding and scaling…
Jonathan
  • 424
  • 4
  • 14
0
votes
1 answer

Why do we include the target class in both the arrays in train_test_split?

X_train, test_df, y_train, y_test = train_test_split(result, y_true, stratify = y_true, test_size = 0.2) In the above sample use of train_test_split, result is the data frame and y_true is a numpy array formed from the target class column from the…
user12518608
0
votes
2 answers

ImportError: cannot import name 'LatentDirichletAllocation'

I'm trying to import the following: from sklearn.model_selection import train_test_split and got following error, here's the stack trace : ImportError Traceback (most recent call last) in…
0
votes
1 answer

How do I predict future results with scikitlearn, pandas in Python using RandomForestRegressor method?

Hello I came across this tutorial on how to use python with some libraries to predict future NCAAB games using a sportsreference library. I will post the code as well as the article. This seems to work well, but I think it is only testing based on…
0
votes
2 answers

sklearn train_test_split returns some elements in both test/train

I have a data-set X with 260 unique observations. when running x_train,x_test,_,_=test_train_split(X,y,test_size=0.2) I would assume that [p for p in x_test if p in x_train] would be empty, but it is not. Actually it turns out that only two…
CutePoison
  • 4,679
  • 5
  • 28
  • 63
0
votes
1 answer

How to split data by using train_test_split in Python Numpy into train, test and validation data set? The split should not random

I want to split data category wise into train, test and validation set. For example: if we have 3 categories positive, negative and neutral in the dataset. The positive category split into train, test, and validation. And the same with the other two…
user85181
  • 11
  • 1
0
votes
1 answer

Found input variables with inconsistent numbers of samples: [24, 25]

I need assistance reshaping my input to match my output. I believe my issue is with my target variable. I am getting the error as stated in the title. I have tried .reshape and .flatten(). Please help, and thanks in advance NEnews_train = [] for…
Deja Bond
  • 1
  • 1
  • 3
0
votes
1 answer

How to print the classified points based on SVM classifier

I was using "svm" classifier to classify it was a bike or car. So, my features were 0,1,2 columns and dependents was 3rd column.I can able to clearly see the classification,but i don't know how to print all the points based on classification in…
0
votes
1 answer

Error while fitting train and test sets, train_test_split method

I am trying to evaluate my model with train_test_split. I have defined the following functions to create the output array on the table (top column) according to the input in function: def top_sh(num): ###Get the top(num) in Shanghai data and…