How do I run test data through my Python Machine Learning Model?

Question

So I have finally completed my first machine learning model in Python. Initially I take a data set and split it like such:

# Split-out validation dataset
array = dataset.values
X = array[:,2:242]
Y = array[:,1]
validation_size = 0.20
seed = 7
X_train, X_validation, Y_train, Y_validation = model_selection.train_test_split(X, Y, test_size=validation_size, random_state=seed)

And so you can see I'm going to use 20% of my data to validate with. But once the model is built, I would like to validate/test it with data that it has never touched before. Do I simply make the same X,Y arrays and make the validation_size = 1? I'm stuck on how to test it without retraining it.

models = []
models.append(('LR', LogisticRegression()))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
#models.append(('SVM', SVC()))
# evaluate each model in turn
results = []
names = []
for name, model in models:
    kfold = model_selection.KFold(n_splits=12, random_state=seed)
    cv_results = model_selection.cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring)
    results.append(cv_results)
    names.append(name)
    msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
    print(msg)


lr = LogisticRegression()
lr.fit(X_train, Y_train)
predictions = lr.predict(X_validation)
print(accuracy_score(Y_validation, predictions))
print(confusion_matrix(Y_validation, predictions))
print(classification_report(Y_validation, predictions))

I can run data through the model, and return a prediction, but how do I test this on 'new' historical data?

I can do something like this to predict: lr.predict([[5.7,...,2.5]])

but not sure how to pass a test data set thru and get a confusion_matrix / classification_report.

score 2 · Accepted Answer · answered Sep 21 '17 at 22:19

[question]: I can run data through the model, and return a prediction, but how do I test this on 'new' historical data?

If you check out my project below you can see how I have trained and tested my data. I personally would never test all of my data. https://github.com/wendysegura/Portland_Forecasting/blob/master/CSV_Police_Files/Random%20Forest%202012-2016.ipynb

General form for sklearn model classes and methods.

model = base_models.AnySKLearnObject()
- create an instance of an estimator class
model.fit(train_X, train_y)
- train your model; also called “fitting your data”
model.score(train_X, train_y)
- score your model using the training data using the default scoring method(recommended to use the metrics module in the future)
model.predict(test_X)
- predict your test data
model.score(test_X, test_y)
- score your model using your test data
model.predict(new_X)
- make predictions for a new set of data

Thank you for the example! I really enjoyed reviewing it. It looks like I just need to just create another test set and run it as model.score(test_X, test_y). — user3486773, Sep 21 '17 at 22:31

score 0 · Answer 2 · answered Sep 21 '17 at 21:24

But once the model is built, I would like to validate/test it with data that it has never touched before.

The reason that you use to split data for train and test (validation) is to run model on data, which is not participated in train set. So your model should not use your test set for learning and don't touch it.

Sometimes, if you want to compare with another test set, you could extract two test sets (with the same method), for example (50%, 25%, 25%), or (70%, 15%, 15%), etc., depends of distribution of your data.

I can run data through the model, and return a prediction, but how do I test this on 'new' historical data?

You use predict method. But when you have "new" data you don't have validation dataset, because you can't know validation dataset for new data. This is why machine learning works with probability, accuracies and other metrics, which can show you how good it would be work on "new" data.

How do I run test data through my Python Machine Learning Model?

2 Answers2