Why is the accuracy reported in pycaret higher than via sklearn for the same model and data?

Question

I tried using pycaret for a machine learning project and got very high accuracies. When I tried to verify these using my sklearn code I found that I could not get the same numbers. Here is an example where I reproduce this issue on the public poker dataset from pycaret:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from pycaret.classification import *
from pycaret.datasets import get_data

data = get_data('poker')

grid = setup(data=data, target='CLASS', fold_shuffle=True, session_id=2)
dt = create_model('dt')

This gives an accuracy using 10-fold cross validation of about 57%. When I try to reproduce this number using sklearn on the same dataset with the same model I get only 49%. Does anyone understand where this difference comes from??

from sklearn.model_selection import cross_val_predict
from sklearn.metrics import accuracy_score

X = data.drop('CLASS', axis = 1)
y = data['CLASS']


y_pred_cv = cross_val_predict(dt, X, y, cv=10)
accuracy_score(y, y_pred_cv)

0.4911698233964679

did you find a solution for this? I have R2 above 90 from pycaret and below 30 with cross_validate function from sklearn! — Mahdi Baghbanzadeh, Apr 06 '23 at 23:37
Hi Mahdi, no unfortunately I have not found a way to get a reproducible way to get the performance from caret to sklearn... — pieterbons, Apr 10 '23 at 17:24

score 1 · Answer 1 · answered Jan 19 '22 at 11:37

1

I think the difference could be due to how your CV folds are being randomized. Did you set the same seed (2) in sklearn? Is the shuffle parameter used in Kfolds set the same?

answered Jan 19 '22 at 11:37

numeralpotatochips

78
1
8

Thanks for your answer! I have tried setting the fold_shuffle to False (had to downgrade sklearn first...) but this does not change the outcome significantly. Also, the fact that the score for each fold is so stable suggests that the specific way in which the fold is generated does not matter that much. – pieterbons Jan 21 '22 at 09:53

score 0 · Answer 2 · answered Aug 04 '22 at 09:41

I had some trouble validating the results from PyCaret myself. I see two options you can try to validate the results:

Is your data correlated in some way? You are using sklearn.model_selection.cross_val_predict and specify cv=10. This means that (stratified) k-fold cross-validation is used to generate your folds. In either case, these splitters are instantiated with shuffle=False. If your data is correlated, this may explain the higher accuracy that you observe. You want to set shuffle=True.
PyCaret by default makes a 70%/30% train/test split. If you use its create_model method, then the cross-validation is done using the train set only. In your validation you use 100% of the data. This might alter the results a bit but I doubt it explains the gap that you observe.

score 0 · Answer 3 · answered Oct 12 '22 at 12:43

0

The parameters could be the same but did you reproduce all features engineering inside the setup ? (feature selection, collinearity, normalisation, etc... )

answered Oct 12 '22 at 12:43

Maxime

75
5

2

As it’s currently written, your answer is unclear. Please [edit] to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Oct 16 '22 at 19:05

Why is the accuracy reported in pycaret higher than via sklearn for the same model and data?

3 Answers3