I can apply scikit-learn function train_test_split
only for two dataframes with training data and target data. But how to split my dataframe including target value into training dataframe and testing dataframe in proportion of 0.75? I don't want to just select first n rows which are 75% of all rows, I want there be random selection as in train_test_split
, but there shouldn't be same rows in train and test data.
Asked
Active
Viewed 2,380 times
-1

sfjac
- 7,119
- 5
- 45
- 69

french_fries
- 1,149
- 6
- 22
-
Please include an example of what code you have tried. – Danny Varod Dec 22 '20 at 00:51
2 Answers
2
This should split your data frame into train and test with a proportion you specify
import pandas as pd
from sklearn.model_selection import train_test_split
df = pd.DataFrame({'numbers': [1, 2, 3, 4, 5], 'colors': ['red', 'white',
'blue', 'green', 'black']},
columns=['numbers', 'colors'])
training_dataset, test_dataset = train_test_split(df, train_size=0.75)

Irfaan
- 155
- 4
1
The first argument to train_test_split
is a sequence of arrays and that sequence can just be one long.
from sklearn.model_selection import train_test_split
from sklearn import datasets
iris = datasets.load_iris()
cols = [f.replace(' (cm)', '').replace(' ','_') for f in iris.feature_names] + ['target']
df = pd.DataFrame(np.c_[iris['data'], iris['target']], columns=cols)
df_train, df_test = train_test_split(df, train_size=0.75)
print(len(df_train) / len(iris.data))
If more than one dataframe/array are passed then they have to be the same length and each is split in the same fashion, so there is flexibility to do this via dataframes or multiple arrays/lists for each column/feature. This is often used to keep the label in a separate container.

sfjac
- 7,119
- 5
- 45
- 69