Testing and training data in machine learning

Question

i have more than 2000 data sets for ANN. I have applied MLPRegressor in it. My code is working fine. But for testing, i want to fix my testing value for instance i have 50 data sets. From that i want to test first 20 value. How do I fix this in the code? I have used the following code.

import pandas as pd

import matplotlib.pyplot as plt

from sklearn.neural_network import MLPRegressor

df = pd.read_csv("0.5-1.csv")
df.head()

X = df[['wavelength', 'phase velocity']]
y = df['shear wave velocity']

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2)

from sklearn.neural_network import MLPClassifier
from sklearn.metrics import mean_absolute_error

mlp = MLPRegressor(hidden_layer_sizes=(30,30,30))

mlp.fit(X_train,y_train)

Mehraban · Accepted Answer · 2019-09-15T14:00:39.673

1

If you want this for reproducible results, you can pass train_test_split a fix random seed so that in each run, same train/test samples are used. The benefit if using train_test_split would be to choose train/test splits nice and clean with no further effort.

But if you insist on manually choosing train/test split like you said, you can do it this way:

X_test, y_test = X[:20], y[:20]  # first 20 samples for test
X_train, y_train = X[20:], y[20:]  # rest of samples for train

edited Sep 15 '19 at 14:00

answered Sep 15 '19 at 13:42

Mehraban

3,164
4
37
60

can you suggest me any better idea to get the better data except manually choosing train/test split? – Sadia Mitu Sep 15 '19 at 13:55
@SadiaMitu what do you mean by better data? – Mehraban Sep 15 '19 at 13:56
First of all, I suggest you to use train/validation/test splits and while training, keep an eye on validation loss + train loss. This way you can tell whether the model is overfitted on training data or not. For data split the best you can do is to ensure validation/test splits are representative of entire dataset. The size of these splits depend on size of dataset. – Mehraban Sep 15 '19 at 14:08

score 0 · Answer 2 · answered Sep 15 '19 at 15:44

0

fix the random seed for numpy as 48 or something else

np.random.seed(48)

this will generate identical splits every time. And use testsize for fixing the size of the split

answered Sep 15 '19 at 15:44

Alex Ferguson

107
2
10

Testing and training data in machine learning

2 Answers2