I'm working on a text classification problem, which I've set up like so (I've left out the data processing steps for concision, but they'll produce a dataframe called data
with columns X
and y
):
import sklearn.model_selection as ms
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
sim = Pipeline([('vec', TfidfVectorizer((analyzer="word", ngram_range=(1, 2))),
("rdf", RandomForestClassifier())])
Now I try to validate this model by training it on 2/3 of the data and scoring it on the remaining 1/3, like so:
train, test = ms.train_test_split(data, test_size = 0.33)
sim.fit(train.X, train.y)
sim.score(test.X, test.y)
# 0.533333333333
I want to do this three times for three different test sets, but using cross_val_score
gives me results that are much lower.
ms.cross_val_score(sim, data.X, data.y)
# [ 0.29264069 0.36729223 0.22977941]
As far as I know, each of the scores in that array should be produced by training on 2/3 of the data and scoring on the remaining 1/3 with the sim.score
method. So why are they all so much lower?