4

I am new to Machine Learning and am in the process of trying to run a simple classification model that I trained and saved using pickle, on another dataset of the same format. I have the following python code.

Code

#Training set
features = pd.read_csv('../Data/Train_sop_Computed.csv')
#Testing set
testFeatures = pd.read_csv('../Data/Test_sop_Computed.csv')

print(colored('\nThe shape of our features is:','green'), features.shape)
print(colored('\nThe shape of our Test features is:','green'), testFeatures.shape)

features = pd.get_dummies(features)
testFeatures = pd.get_dummies(testFeatures)

features.iloc[:,5:].head(5)
testFeatures.iloc[:,5].head(5)

labels = np.array(features['Truth'])
testlabels = np.array(testFeatures['Truth'])

features= features.drop('Truth', axis = 1)
testFeatures = testFeatures.drop('Truth', axis = 1)

feature_list = list(features.columns)
testFeature_list = list(testFeatures.columns)

def add_missing_dummy_columns(d, columns):
    missing_cols = set(columns) - set(d.columns)
    for c in missing_cols:
        d[c] = 0


def fix_columns(d, columns):
    add_missing_dummy_columns(d, columns)

    # make sure we have all the columns we need
    assert (set(columns) - set(d.columns) == set())

    extra_cols = set(d.columns) - set(columns)
    if extra_cols: print("extra columns:", extra_cols)

    d = d[columns]
    return d


testFeatures = fix_columns(testFeatures, features.columns)

features = np.array(features)
testFeatures = np.array(testFeatures)

train_samples = 100

X_train, X_test, y_train, y_test = model_selection.train_test_split(features, labels, test_size = 0.25, random_state = 42)
testX_train, textX_test, testy_train, testy_test = model_selection.train_test_split(testFeatures, testlabels, test_size= 0.25, random_state = 42)

print(colored('\n        TRAINING SET','yellow'))
print(colored('\nTraining Features Shape:','magenta'), X_train.shape)
print(colored('Training Labels Shape:','magenta'), X_test.shape)
print(colored('Testing Features Shape:','magenta'), y_train.shape)
print(colored('Testing Labels Shape:','magenta'), y_test.shape)

print(colored('\n        TESTING SETS','yellow'))
print(colored('\nTraining Features Shape:','magenta'), testX_train.shape)
print(colored('Training Labels Shape:','magenta'), textX_test.shape)
print(colored('Testing Features Shape:','magenta'), testy_train.shape)
print(colored('Testing Labels Shape:','magenta'), testy_test.shape)

from sklearn.metrics import precision_recall_fscore_support

import pickle

loaded_model_RFC = pickle.load(open('../other/SOPmodel_RFC', 'rb'))
result_RFC = loaded_model_RFC.score(textX_test, testy_test)
print(colored('Random Forest Classifier: ','magenta'),result_RFC)

loaded_model_SVC = pickle.load(open('../other/SOPmodel_SVC', 'rb'))
result_SVC = loaded_model_SVC.score(textX_test, testy_test)
print(colored('Support Vector Classifier: ','magenta'),result_SVC)

loaded_model_GPC = pickle.load(open('../other/SOPmodel_Gaussian', 'rb'))
result_GPC = loaded_model_GPC.score(textX_test, testy_test)
print(colored('Gaussian Process Classifier: ','magenta'),result_GPC)

loaded_model_SGD = pickle.load(open('../other/SOPmodel_SGD', 'rb'))
result_SGD = loaded_model_SGD.score(textX_test, testy_test)
print(colored('Stocastic Gradient Descent: ','magenta'),result_SGD)

I am able to get the results for the test set.

But the problem I am facing is that I need to run the model on the entire Test_sop_Computed.csv dataset. But it is only being run on the test dataset that I've split. I would sincerely appreciate if anyone could provide any suggestions on how I can run the loaded model on the entire dataset. I know that I'm going wrong with the following line of code.

testX_train, textX_test, testy_train, testy_test = model_selection.train_test_split(testFeatures, testlabels, test_size= 0.25, random_state = 42)

Both the train and test dataset have the Subject, Predicate, Object, Computed and Truth and the features with the Truth being the predicted class. The testing dataset has the actual values for this Truth column and I dopr it usingtestFeatures = testFeatures.drop('Truth', axis = 1) and intend on using the various loaded models of classifiers to predict this Truth as 0 or 1 for the entire dataset and then get the predictions as an array.

I have done this so far. But I think that I am splitting my test dataset as well. Is there a way to pass the entire test dataset even if it is in another file?

This test dataset is in the same format as the training set. I have checked the shape of the two and I get the following.

Confirming the Features and Shape

Shape of the Train features is: (1860, 5)
Shape of the Test features is: (1386, 5)

         TRAINING SET

Training Features Shape: (1395, 1045)
Training Labels Shape: (465, 1045)
Testing Features Shape: (1395,)
Testing Labels Shape: (465,)

          TEST SETS

Training Features Shape: (1039, 1045)
Training Labels Shape: (347, 1045)
Testing Features Shape: (1039,)
Testing Labels Shape: (347,)

Any suggestions in this regard will be highly appreciated.

desertnaut
  • 57,590
  • 26
  • 140
  • 166
Nayantara Jeyaraj
  • 2,624
  • 7
  • 34
  • 63
  • Since the question does not involve `tensorflow`, kindly avoid spamming the tag (removed) - `scikit-learn` is much more appropriate (added). – desertnaut Dec 12 '18 at 10:01
  • I am highly confused about your data set and the way you treat it. How does your training set contain both trainX and testX? What is `testX_train` supposed to mean? – offeltoffel Dec 12 '18 at 10:08
  • @offeltoffel, The `test` string was used to identify the subsets of the split that I unnecessarily did with the testing set. This is what I needed to be clarified and now it works. Thanks for getting back to me in response to my doubt. – Nayantara Jeyaraj Dec 12 '18 at 10:14

1 Answers1

3

Your question is a bit unclear but as I understand, you want to run your model on testX_train and on testX_test (which is just testFeatures splitted into two sub datasets).

So, either you can run your model on testX_train the same way you did for testX_test, e.g. :

result_RFC_train = loaded_model_RFC.score(textX_train, testy_train)

or you can just remove the following line :

testX_train, textX_test, testy_train, testy_test = model_selection.train_test_split(testFeatures, testlabels, test_size= 0.25, random_state = 42)

So you just don't split you data and run it on the full dataset :

result_RFC_train = loaded_model_RFC.score(testFeatures, testlabels)

LaSul
  • 2,231
  • 1
  • 20
  • 36
  • Thanks Alexandre. This worked for me. What I'd like to know in addition, is on how to get the predicted values as in using predict and getting the model predicted values as an array/list? – Nayantara Jeyaraj Dec 12 '18 at 10:11
  • 1
    Don't hesitate to make another question for this one to get a full detailed answer but i'll answer here. Depending of your model you'll get either values or probabilities : `predicted_y_RFC = loaded_model_RFC.predict(testFeatures)` `predicted_probas_y_RFC = loaded_model_RFC.predict_probas(testFeatures)` – LaSul Dec 12 '18 at 10:14
  • Thanks a million Alexandre. This is exactly what I was looking for. – Nayantara Jeyaraj Dec 12 '18 at 10:18
  • 1
    You're welcome ! Don't hesitate to take a look at the doc on sklearn : https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html You have explanation about loading the model, fitting, predicting etc.. :) If you search a bit on the sklearn website, you will also find how to save a model, plot data / predictions, how to score models etc.. – LaSul Dec 12 '18 at 10:26
  • Thanks for your well elucidated explanation. I really appreciate it – Nayantara Jeyaraj Dec 13 '18 at 03:55