-1

I can apply scikit-learn function train_test_split only for two dataframes with training data and target data. But how to split my dataframe including target value into training dataframe and testing dataframe in proportion of 0.75? I don't want to just select first n rows which are 75% of all rows, I want there be random selection as in train_test_split, but there shouldn't be same rows in train and test data.

sfjac
  • 7,119
  • 5
  • 45
  • 69
french_fries
  • 1,149
  • 6
  • 22

2 Answers2

2

This should split your data frame into train and test with a proportion you specify

import pandas as pd
from sklearn.model_selection import train_test_split
        
df = pd.DataFrame({'numbers': [1, 2, 3, 4, 5], 'colors': ['red', 'white', 
                           'blue', 'green', 'black']}, 
                          columns=['numbers', 'colors'])
        
training_dataset, test_dataset = train_test_split(df, train_size=0.75)
Irfaan
  • 155
  • 4
1

The first argument to train_test_split is a sequence of arrays and that sequence can just be one long.

from sklearn.model_selection import train_test_split
from sklearn import datasets

iris = datasets.load_iris()
cols = [f.replace(' (cm)', '').replace(' ','_') for f in iris.feature_names] + ['target']
df = pd.DataFrame(np.c_[iris['data'], iris['target']], columns=cols)

df_train, df_test = train_test_split(df, train_size=0.75)

print(len(df_train) / len(iris.data))

If more than one dataframe/array are passed then they have to be the same length and each is split in the same fashion, so there is flexibility to do this via dataframes or multiple arrays/lists for each column/feature. This is often used to keep the label in a separate container.

sfjac
  • 7,119
  • 5
  • 45
  • 69