0

I'm using sklearn to classify documents. But I got in trouble splitting the sparse matrix produced by TfidfTransformer which contains the corpus of both the train and the test data.

Here is part of my code:

vectorizer = CountVectorizer()
transformer = TfidfTransformer(norm="l2", use_idf=True, smooth_idf=True, sublinear_tf=True)
matrix = transformer.fit_transform(vectorizer.fit_transform(corpus))

Here corpus is the direct combination of the train data and the test data(i.e. read the train data and then the test data) and I want to split matrix to obtain x_train and the x_test.

train_test_split() cannot be used because it's stochastic but I only want to split the matrix.

Thanks in advance.

  • Two questions, 1) why don't you split your dataset file into separate train and test files and 2) can you explain what you mean by "`train_test_split()` cannot be used because it's stochastic [...]"? – tttthomasssss Apr 07 '17 at 18:24
  • 1) I want to get the tf-idf matrix on both train data and test data. 2) I mean `train_test_split()` is good but the data obtained by this method are randomly chosen. – Kipsora Lawrence Apr 08 '17 at 02:42

0 Answers0