I'm using sklearn
to classify documents. But I got in trouble splitting the sparse matrix produced by TfidfTransformer
which contains the corpus of both the train and the test data.
Here is part of my code:
vectorizer = CountVectorizer()
transformer = TfidfTransformer(norm="l2", use_idf=True, smooth_idf=True, sublinear_tf=True)
matrix = transformer.fit_transform(vectorizer.fit_transform(corpus))
Here corpus
is the direct combination of the train data and the test data(i.e. read the train data and then the test data)
and I want to split matrix
to obtain x_train
and the x_test
.
train_test_split()
cannot be used because it's stochastic
but I only want to split the matrix.
Thanks in advance.