What is an acceptable enough difference between the accuracy of the Train_set and Test_set?

Question

I am working on a Data Science project which is a model to predict whether the imports are Fake or not. I have a training database on which one of my models is achieving up to 92-93% accuracy but on 51% of the test database, it is achieving only around 83-85% accuracy.

On the other hand, after a few changes, I got around 90% accuracy on the training database but got around 91% on the test database.

Earlier, I was worried that I might be getting a higher accuracy for the train set as I might be overfitting, but now I am worried because the accuracy of my test set is higher than that of my train set(which should not be the case). Can someone help me out by explaining what steps I should take in this case and what might be happening here? As I can only submit one final model.

P.S. the classifier I have used in both cases are the same (KNN and GridSearchCV) but the way I have prepared the data before applying the classifier is different.

ahmedshahriar · Answer 1 · 2022-06-05T14:01:14.213

it depends.

Could be your train/test split percentage. Data splitting is unreliable unless the total sample size is huge. Imagine if you're using 99% of the data to train, and 5% for the test, then obviously testing set accuracy will be better than the testing set, 99 times out of 100.

Apply 5/10 fold cross-validation with different random seeds. (try StratifiedShuffleSplit if applicable)

try with different classifiers depending on your data and it's features.

here is a cross validation snippet with model tuning (grid search/random search)

from sklearn.model_selection import GridSearchCV, RandomizedSearchCV, StratifiedKFold
from sklearn.metrics         import accuracy_score

# configure the cross-validation procedure
kf = StratifiedKFold(n_splits = 10 , shuffle = True , random_state = 42)

def tune_hyperparameter(search_method, estimator, search_space):
    
    # enumerate splits
    outer_results = list()
    for train_ix, test_ix in kf.split(X,y):
            
        # split data
        X_train, X_test = X[train_ix, :], X[test_ix, :]
        y_train, y_test = y[train_ix], y[test_ix]

        # configure the cross-validation procedure
        cv_inner = KFold(n_splits=5, shuffle=True, random_state=1)

        if search_method == "grid":
            clf = GridSearchCV(
                estimator=estimator, 
                param_grid=search_space, 
                scoring='accuracy',
                n_jobs=-1, 
                cv=cv_inner, 
                verbose=0,
                refit=True
            )
        elif search_method == "random":           
            clf = RandomizedSearchCV(
                estimator=estimator,
                param_distributions=search_space,
                n_iter=10,
                n_jobs=-1,
                cv=cv_inner,
                verbose=0,
                random_state=1,
                refit=True
            )
            
        # execute grid search
        result = clf.fit(X_train, y_train)

        # get the best performing model fit on the whole training set
        best_model = result.best_estimator_

        # evaluate model on the hold out dataset
        yhat = best_model.predict(X_test)

        # evaluate the model
        acc = accuracy_score(y_test, yhat)

        # store the result
        outer_results.append(acc)

        # report progress
        print('acc=%.3f, est=%.3f, cfg=%s' % (acc, result.best_score_, result.best_params_))
        
    # summarize the estimated performance of the model
    print('Accuracy: %.3f (%.3f)' % (np.mean(outer_results), np.std(outer_results)))

    print("Best",search_method,"Model : ", best_model)
    print("-"*50, '\n\n')
    return best_model

Usage (with SVM)

from sklearn import svm

svm_model = svm.SVC(random_state=42)

svm_param_grid = {
    'C': [0.1, 1, 10, 100, 100], 
    'gamma': [1, 0.1, 0.01, 0.001, 0.0001],
    'kernel': ['linear','rbf', 'poly', 'sigmoid']
}

svm_param_random = {
    "C": scipy.stats.expon(scale=.01),
    "gamma": scipy.stats.expon(scale=.01),
    "kernel": ['linear','rbf', 'poly', 'sigmoid']
}

best_svm_grid_searched_model = tune_hyperparameter('grid', svm_model, svm_param_grid)
best_svm_randomized_searched_model = tune_hyperparameter('random', svm_model, svm_param_random)

If you have a large dataset, then keep the test set separate, then split the train into train & validation set, and do cross-validation

For a small dataset, use cross-validation and you may split the whole dataset into train/test

What is an acceptable enough difference between the accuracy of the Train_set and Test_set?

1 Answers1