0

This is my minimal reproducible example:

import numpy as np
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import cross_validate

x = np.array([
   [1, 2],
   [3, 4],
   [5, 6],
   [6, 7]
])  
y = [1, 0, 0, 1]

model = GaussianNB()
scores = cross_validate(model, x, y, cv=2, scoring=("accuracy"))

model.predict([8,9])

What I intended to do is instantiating a Gaussian Naive Bayes Classifier and use sklearn.model_selection.cross_validate for cross validate my model (I am using cross_validate instead of cross_val_score since in my real project I need precision, recall and f1 as well).

I have read in the doc that cross_validate does "evaluate metric(s) by cross-validation and also record fit/score times."

I expected that my model would have been fitted on x (features), y (labels) data but when I invoke model.predict(.) I get:

sklearn.exceptions.NotFittedError: This GaussianNB instance is not fitted yet. Call 'fit' with appropriate arguments before using this estimator.

Of course it says me about invoking model.fit(x,y) before "using the estimator" (that is before invoking model.predict(.).

Shouldn't the model have been fitted cv=2 times when I invoke cross_validate(...)?

desertnaut
  • 57,590
  • 26
  • 140
  • 166
tail
  • 355
  • 2
  • 11

1 Answers1

1

A close look at cross_validate documentation reveals that it includes an argument:

return_estimator : bool, default=False

Whether to return the estimators fitted on each split.

So, by default it will not return any fitted estimator (hence it cannot be used to predict).

In order to predict with the fitted estimator(s), you need to set the argument to True; but beware, you will not get a single fitted model, but a number of models equal to your cv parameter value (here 2):

import numpy as np
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import cross_validate

x = np.array([
   [1, 2],
   [3, 4],
   [5, 6],
   [6, 7]
])  
y = [1, 0, 0, 1]

model = GaussianNB()
scores = cross_validate(model, x, y, cv=2, scoring=("accuracy"), return_estimator=True)
scores
# result:
{'fit_time': array([0.00124454, 0.00095725]),
 'score_time': array([0.00090432, 0.00054836]),
 'estimator': [GaussianNB(), GaussianNB()],
 'test_score': array([0.5, 0.5])}

So, in order to get predictions from each fitted model, you need:

scores['estimator'][0].predict([[8,9]])
# array([1])

scores['estimator'][1].predict([[8,9]])
# array([0])

This may look inconvenient, but it is like that by design: cross_validate is generally meant only to return the scores necessary for diagnosis and assessment, not to be used for fitting models which are to be used for predictions.

desertnaut
  • 57,590
  • 26
  • 140
  • 166
  • @tail no; as said, this is by design: with `cross_validate`, you can only get back as many models as your `cv` folds. There are other CV tools available in scikit-learn that return a final fitted model, like [`GridSearcgCV`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV) for this purpose. – desertnaut Mar 12 '23 at 19:57
  • Thanks for your feedback. Is there any chance I can get one (1) model instead of k models? If not, which estimator am I supposed to choose for predicting labels for unseen data? – tail Mar 12 '23 at 19:59
  • @tail I have already answered that above (please refrain from deleting comments that have been already answered). You **don't want** to use **any** of the estimators returned by `cross_validate` for predictions, since by definition they are not trained on the whole of your data; as said, `cross_validate` is not meant for such usage. – desertnaut Mar 12 '23 at 20:01
  • May I use GridSearchCV for having one (1) "final" estimator, then? – tail Mar 12 '23 at 20:02
  • @tail `GridSearchCV`, used with the default argument `refit=True`, will indeed return a single model fitted with all the data. Please check the docs (link in my 1st comment above). – desertnaut Mar 12 '23 at 20:04
  • GridSearchCV does require a param param_grid – tail Mar 12 '23 at 20:19
  • @tail please notice that the answer addressed fully your original question on why you get an error with your approach; it has nothing to do with GridSearchCV or any other issues not present in your original post. If you have a new question, please open a new post - we cannot guess what exactly you are trying to do from such fragmented conversation in the comments (1/2) – desertnaut Mar 12 '23 at 20:43
  • In any case, especially for `GaussianNB` (which has actually **no** tunable parameters), it does not make any sense to use `GridSearchCV` - you just fit the model with all your data and you are fine, after you have used `cross_validate` in order to get a performance assessment. (2/2) – desertnaut Mar 12 '23 at 20:44
  • Actually GaussianNB has var_smoothing float, default=1e-9 Portion of the largest variance of all features that is added to variances for calculation stability. Can't I use grid search CV for tuning that hyperparam? – tail Mar 12 '23 at 21:06
  • @tail you sure can, at least in theory; not sure how relevant it is in practice though - but again I have almost zero experience with GaussianNB. The experiment is arguably your best friend here. – desertnaut Mar 12 '23 at 21:12