2

I tried using pycaret for a machine learning project and got very high accuracies. When I tried to verify these using my sklearn code I found that I could not get the same numbers. Here is an example where I reproduce this issue on the public poker dataset from pycaret:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from pycaret.classification import *
from pycaret.datasets import get_data

data = get_data('poker') 

enter image description here

grid = setup(data=data, target='CLASS', fold_shuffle=True, session_id=2)
dt = create_model('dt')

enter image description here

This gives an accuracy using 10-fold cross validation of about 57%. When I try to reproduce this number using sklearn on the same dataset with the same model I get only 49%. Does anyone understand where this difference comes from??

from sklearn.model_selection import cross_val_predict
from sklearn.metrics import accuracy_score

X = data.drop('CLASS', axis = 1)
y = data['CLASS']


y_pred_cv = cross_val_predict(dt, X, y, cv=10)
accuracy_score(y, y_pred_cv)

0.4911698233964679

pieterbons
  • 1,604
  • 1
  • 11
  • 14

3 Answers3

1

I think the difference could be due to how your CV folds are being randomized. Did you set the same seed (2) in sklearn? Is the shuffle parameter used in Kfolds set the same?

  • Thanks for your answer! I have tried setting the fold_shuffle to False (had to downgrade sklearn first...) but this does not change the outcome significantly. Also, the fact that the score for each fold is so stable suggests that the specific way in which the fold is generated does not matter that much. – pieterbons Jan 21 '22 at 09:53
0

I had some trouble validating the results from PyCaret myself. I see two options you can try to validate the results:

  1. Is your data correlated in some way? You are using sklearn.model_selection.cross_val_predict and specify cv=10. This means that (stratified) k-fold cross-validation is used to generate your folds. In either case, these splitters are instantiated with shuffle=False. If your data is correlated, this may explain the higher accuracy that you observe. You want to set shuffle=True.
  2. PyCaret by default makes a 70%/30% train/test split. If you use its create_model method, then the cross-validation is done using the train set only. In your validation you use 100% of the data. This might alter the results a bit but I doubt it explains the gap that you observe.
J. Adan
  • 51
  • 5
0

The parameters could be the same but did you reproduce all features engineering inside the setup ? (feature selection, collinearity, normalisation, etc... )

Maxime
  • 75
  • 5
  • 2
    As it’s currently written, your answer is unclear. Please [edit] to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Oct 16 '22 at 19:05