18

I'm working on a text classification problem, which I've set up like so (I've left out the data processing steps for concision, but they'll produce a dataframe called data with columns X and y):

import sklearn.model_selection as ms
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier

sim = Pipeline([('vec', TfidfVectorizer((analyzer="word", ngram_range=(1, 2))),
                ("rdf", RandomForestClassifier())])

Now I try to validate this model by training it on 2/3 of the data and scoring it on the remaining 1/3, like so:

train, test = ms.train_test_split(data, test_size = 0.33)
sim.fit(train.X, train.y)
sim.score(test.X, test.y)
# 0.533333333333

I want to do this three times for three different test sets, but using cross_val_score gives me results that are much lower.

ms.cross_val_score(sim, data.X, data.y)
# [ 0.29264069  0.36729223  0.22977941]

As far as I know, each of the scores in that array should be produced by training on 2/3 of the data and scoring on the remaining 1/3 with the sim.score method. So why are they all so much lower?

Empiromancer
  • 3,778
  • 1
  • 22
  • 53

1 Answers1

18

I solved this problem in the process of writing my question, so here it goes:

The default behavior for cross_val_score is to use KFold or StratifiedKFold to define the folds. By default, both have argument shuffle=False, so the folds are not pulled randomly from the data:

import numpy as np
import sklearn.model_selection as ms

for i, j in ms.KFold().split(np.arange(9)):
    print("TRAIN:", i, "TEST:", j)
TRAIN: [3 4 5 6 7 8] TEST: [0 1 2]
TRAIN: [0 1 2 6 7 8] TEST: [3 4 5]
TRAIN: [0 1 2 3 4 5] TEST: [6 7 8]

My raw data was arranged by label, so with this default behavior I was trying to predict a lot of labels I hadn't seen in the training data. This is even more pronounced if I force use of KFold (I was doing classification, so StratifiedKFold was the default):

ms.cross_val_score(sim, data.text, data.label, cv = ms.KFold())
# array([ 0.05530776,  0.05709188,  0.025     ])
ms.cross_val_score(sim, data.text, data.label, cv = ms.StratifiedKFold(shuffle = False))
# array([ 0.2978355 ,  0.35924933,  0.27205882])
ms.cross_val_score(sim, data.text, data.label, cv = ms.KFold(shuffle = True))
# array([ 0.51561106,  0.50579839,  0.51785714])
ms.cross_val_score(sim, data.text, data.label, cv = ms.StratifiedKFold(shuffle = True))
# array([ 0.52869565,  0.54423592,  0.55626715])

Doing things by hand was giving me higher scores because train_test_split was doing the same thing as KFold(shuffle = True).

Empiromancer
  • 3,778
  • 1
  • 22
  • 53
  • 3
    Also, using the `random_state` parameter in `cross_val_score` and your manual `KFold`, you can make sure that exact same results are achieved – Vivek Kumar Apr 29 '17 at 02:50
  • 1
    Note that if shuffling cases gives significantly higher scores than not shuffling, and the data set is not representative of the entire population where the classifier will be used to perform prediction, such higher scores may in fact suggest overfitting and they are misleading. This is because the classifier is tested on cases which are very similar to those from the training set, so the train-test split does not serve its purpose and generalization may be poor. – dolphin May 16 '20 at 22:22