I've created a binary classification model which predicts whether an article is part of the positive or negative class. I am using TF-IDF fed into an XGBoost classifier alongside another feature. I get an AUC score of very close to 1 when both training/testing and crossvalidating. I got a .5 score when testing on my holdout data. This seemed odd to me, so I fed the very same training data into my model, and even that returns a .5 AUC score. The code below takes in a dataframe, fits and transforms to the tf-idf vectors and formats it all into a dMatrix.
def format_to_dmatrix(known_targets):
y = known_targets['target']
X = known_targets[['body', 'day_of_year']]
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=.1, random_state=42)
tfidf.fit(X_train['body'])
pickle.dump(tfidf.vocabulary_,open("tfidf_features.pkl","wb"))
X_train_enc = tfidf.transform(X_train['body']).toarray()
X_test_enc = tfidf.transform(X_test['body']).toarray()
new_cols = tfidf.get_feature_names()
new_cols.append('day_of_year')
a = np.array(X_train['day_of_year'])
a = a.reshape(a.shape[0], 1)
b = np.array(X_test['day_of_year'])
b = b.reshape(b.shape[0], 1)
X_train = np.append(X_train_enc, a, axis=1)
X_test = np.append(X_test_enc, b, axis=1)
dtrain = xgb.DMatrix(X_train, label=y_train.values, feature_names=new_cols)
dtest = xgb.DMatrix(X_test, label=y_test.values, feature_names=new_cols)
return dtrain, dtest, tfidf
I cross validate and find a test-auc-mean of .9979, so I save the model as shown below.
best_model = xgb.train(
params,
dtrain,
num_boost_round=num_boost_round,
evals=[(dtest, "Test")]
This is my code to load in new data:
def test_newdata(data):
tf1 = pickle.load(open("tfidf_features.pkl", 'rb'))
tf1_new = TfidfVectorizer(max_features=1500, lowercase=True, analyzer='word', stop_words='english', ngram_range=(1, 1), vocabulary = tf1.keys())
encoded_body = tf1_new.fit_transform(data['body']).toarray()
new_cols = tf1_new.get_feature_names()
new_cols.append('day_of_year')
day_of_year = np.array(data['day_of_year'])
day_of_year = day_of_year.reshape(day_of_year.shape[0], 1)
formatted_test_data = np.append(encoded_body, day_of_year, axis=1)
df= pd.DataFrame(formatted_test_data, columns=new_cols)
return xgb.DMatrix(df)
And this code below shows that my AUC score is .5 despite loading in the very same data. Is there an error i've missed somewhere?
loaded_model = xgb.Booster()
loaded_model.load_model("earn_modelv3.model")
holdout = known_targets
formatted_test_data = test_newdata(holdout)
holdout_preds = loaded_model.predict(formatted_test_data)
predictions_binary = np.where(holdout_preds > .5, 1, 0)
{round(roc_auc_score(holdout['target'], predictions_binary) ,4)