Questions tagged [train-test-split]

Questions with this tag are about how to split the machine learning data set into random train and test subsets.

Questions with this tag are about how to split the machine learning data set into random train and test subsets.

In particular questions with this tag can be aimed at understanding better how to split the data with the scikit-learn functionality. In scikit-learn a random split into training and test sets can be quickly computed with the train_test_split helper function.

Reference: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

428 questions
2
votes
1 answer

Splitting train test sets for Node2vec link prediction in Stellargraph

I'm trying to understand how to use Stellargraph's EdgeSplitter class. In particular, the examples on the documentation for training a link prediction model based on Node2Vec splits the graph in the following parts: Distrution of samples across…
2
votes
1 answer

scikit learn test_data_split: ValueError: Found input variables with inconsistent numbers of samples:[4999, 5000]

Here is my code print(len(image_dataset.data)) print(len(phylum_target)) X_train, X_test, y_train, y_test = train_test_split(image_dataset.data, phylum_target, test_size=0.2,random_state=109) And here is output and Error 5000 5000 Traceback (most…
2
votes
0 answers

scikit learn train_test_split() behaving splitting data unexpectedly

I'm facing this issue where sklearn's train_test_split() is dividing data sets abruptly in case of large data sets. I'm trying to load the entire data set of 118 MB, and it is assigning test data less than 10 times of what is expected of code. Case…
2
votes
1 answer

Train and test split set using ImageDataGenerator and flow

I'm trying to make a network using augmentation. First I use ImageDataGenerator with validation_split=0.2. train_generator = ImageDataGenerator( rotation_range=90, zoom_range=0.15, width_shift_range=0.2, height_shift_range=0.2, …
2
votes
2 answers

Dimensional problem in using train test split

from sklearn.model_selection import train_test_split predictors=data.drop(['target'],axis=1) targets=data['target'] train_x,test_x,train_y,test_y=train_test_split(predictors,targets,test_size=0.2,random_state=0) shape of train_x is…
2
votes
1 answer

NameError: name 'skimage' is not defined

im trying to figure out how to use SVM for image classification using images from my own dataset, to which im using the notebook from his link: https://github.com/whimian/SVM-Image-Classification. The problem is that, for whatever other project i…
user11597888
2
votes
1 answer

Cannot impute 1D array with fit_transform from sklearn library (split-test)

I'm trying to impute 1D array with shape (14599,) with simple imputer with most_frequent strategy but it said it expected 2D array, i already tried reshaping it (-1,1) and (1,-1) but its error ValueError: could not broadcast input array from shape…
random student
  • 683
  • 1
  • 15
  • 33
2
votes
1 answer

Error message when I try to do train test split on credit card default data

I tried to do a train test split on credit card default data from https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients# This is my code: import sklearn import pandas as pd data = pd.read_excel("default of credit card clients.xls",…
2
votes
2 answers

How to split data into train and test keeping in mind the groupby column in pandas?

I would like to split the data set into test and train dataset in the ratio 20:80. However, while splitting, I do not want to split in a manner that 1 S_Id value has few data points in train and other data points in test. I have a dataset as: S_Id …
Jupyter
  • 131
  • 1
  • 10
2
votes
1 answer

How is train_test_split with test_size=0 affecting the data?

I was using train_test_split in my code and then wanted to change it to cross validation, but something strange is hapenning. train, test = train_test_split(data, test_size=0) x_train = train.drop('CRO', axis=1) y_train = train['CRO'] scaler =…
2
votes
3 answers

Train test split based on a column values - sequentially

i have a data frame as below df = pd.DataFrame({"Col1": ['A','B','B','A','B','B','A','B','A', 'A'], "Col2" : [-2.21,-9.59,0.16,1.29,-31.92,-24.48,15.23,34.58,24.33,-3.32], "Col3" :…
Shijith
  • 4,602
  • 2
  • 20
  • 34
2
votes
1 answer

PySpark randomSplit vs SkLearn Train Test Split - Random Seed Question

Let's say I have a pandas dataframe and apply sklearn.model_selection.train_test_split with the random_seed parameter set to 1. Let's say I then take the exact same pandas dataframe and create a Spark Dataframe with an instance of SQLContext. If I…
Odisseo
  • 747
  • 1
  • 13
  • 32
2
votes
2 answers

Order between using validation, training and test sets

I am trying to understand the process of model evaluation and validation in machine learning. Specifically, in which order and how the training, validation and test sets must be used. Let's say I have a dataset and I want to use linear regression.…
2
votes
1 answer

Splitting dataset for training and testing row wise

I want to split my dataset into training and test datasets based on years. The idea is to put the rows with years ranging form 2009-2017 in train dataset and the 2018 data in test dataset. Splitting the datasets was easy for the most part but my…
2
votes
1 answer

how to correct ImportError: cannot import name 'murmurhash3_32'

I installed scikit-learn library in python using the command pip install -U scikit-learn When I am trying to import the library or it's module like from sklearn.model_selection import train_test_split or simply import sklearn I am getting the…
Aklank Jain
  • 1,002
  • 1
  • 13
  • 21