Questions tagged [train-test-split]

Questions with this tag are about how to split the machine learning data set into random train and test subsets.

Questions with this tag are about how to split the machine learning data set into random train and test subsets.

In particular questions with this tag can be aimed at understanding better how to split the data with the scikit-learn functionality. In scikit-learn a random split into training and test sets can be quickly computed with the train_test_split helper function.

Reference: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

428 questions
7
votes
2 answers

How to split dataset to train, test and valid in Python?

I have a dataset like this my_data= [['Manchester', '23', '80', 'CM', 'Manchester', '22', '79', 'RM', 'Manchester', '19', '76', 'LB'], ['Benfica', '26', '77', 'CF', 'Benfica', '22', '74', 'CDM', 'Benfica', '17', '70', 'RB'], ['Dortmund',…
dede.brahma
  • 329
  • 7
  • 13
  • 24
6
votes
3 answers

processing before or after train test split

I am using this excellent article to learn Machine learning. https://stackabuse.com/python-for-nlp-multi-label-text-classification-with-keras/ The author has tokenized the X and y data after splitting it up. X_train, X_test, y_train, y_test =…
shantanuo
  • 31,689
  • 78
  • 245
  • 403
6
votes
1 answer

dimension mismatch error in CountVectorizer MultinomialNB

Before I lodge this question, I have to say I've thoroughly read more than 15 similar topics on this board, each with somehow different recommendations, but all of them just could not get me right. Ok, so I split my 'spam email' text data…
Chris T.
  • 1,699
  • 7
  • 23
  • 45
6
votes
3 answers

Randomly distribute files into train/test given a ratio

I am at the moment trying make a setup script, capable of setting up a workspace up for me, such that I don't need to do it manually. I started doing this in bash, but quickly realized that would not work that well. My next idea was to do it…
Mønster
  • 61
  • 1
  • 4
5
votes
3 answers

Splitting datasets into train and test in julia

I am trying to split the dataset into train and test subsets in Julia. So far, I have tried using MLDataUtils.jl package for this operation, however, the results are not up to the expectations. Below are my findings and issues: Code # the inputs…
Mohammad Saad
  • 1,935
  • 10
  • 28
5
votes
1 answer

What to make of a flat validation accuracy curve in a learning curve graph

While plotting a learning curve to see how well the model building was going, I realized that the validation accuracy curve was a straight line from the get-go. I thought maybe it was just due to some error in splitting the data into training and…
5
votes
1 answer

stratify argument in train_test_split vs StratifiedShuffleSplit

What is the difference between using the stratify argument in train_test_split function of sklearn, and the StratifiedShuffleSplit function? Don't they do the same thing?
Rohan Pinto
  • 51
  • 1
  • 5
4
votes
3 answers

How to split datatable dataframe into train and test dataset in python

I am using datatable dataframe. How can I split the dataframe into train and test dataset? Similarly to pandas dataframe, I tried to use train_test_split(dt_df,classes) from sklearn.model_selection, but it doesn't work and I get error. import…
ibra
  • 1,164
  • 1
  • 11
  • 26
4
votes
1 answer

Undersampling for imbalance data after train test split

I am working on a project with imbalanced data. I want to balance the data using random undersampling. I am confused if i should do the undersampling after test train split or should i do undersampling 1st and then do train test split? My approach…
sarika
  • 49
  • 1
  • 2
4
votes
1 answer

Use only N Images using ImageDataGenerator from each class

There are 10 directories(labels) each with 800 images. I'm trying to use transfer learning to train my model. The data is loaded using ImageDataGenerator as shown below: train_datagen = ImageDataGenerator(rescale=1./255, shear_range=0.2, …
Jedi Nerd
  • 49
  • 8
4
votes
4 answers

train_test_split( ) method of scikit learn

I am trying to create a machine learning model using DecisionTreeClassifier. To train & test my data I imported train_test_split method from scikit learn. But I can not understand one of its arguments called random_state. What is the…
4
votes
1 answer

How does Machine Learning algorithm retain learning from previous execution?

I am reading Hands on Machine Learning book and author talks about random seed during train and test split, and at one point of time, the author says over the period Machine will see your whole dataset. Author is using following function for…
Sachin Rastogi
  • 409
  • 5
  • 8
4
votes
2 answers

Problems with diagnostics of prophet forecast

I am working with an dataset of crimes in chicago and specially working on a future prediction of the crime rate in chicago (from 2012 till 2016 I have data). I generated a forecast using the prophet package of facebook. It worked very well and all…
Scrappy
  • 51
  • 1
  • 9
4
votes
1 answer

Getting Validation set from Train set by using percentage from groupby() in pandas

Have a train dataset with multi-class target variable category train.groupby('category').size() 0 2220 1 4060 2 760 3 1480 4 220 5 440 6 23120 7 1960 8 64840 I would like to get the new validation dataset from…
Keithx
  • 2,994
  • 15
  • 42
  • 71
4
votes
2 answers

ML.NET TrainTestSplit random seed

I am using TrainTestSplit in ML.NET, to repeatedly split my data set into a training and test set. In e.g. sklearn, the corresponding function takes a seed as an input, so that it is possible to obtain different splits, but in ML.NET repeated calls…
Petter T
  • 3,387
  • 2
  • 19
  • 31
1
2
3
28 29