Scikit-learn's Pipeline: Error with multilabel classification. A sparse matrix was passed

Question

I am implementing different classifiers using different machine learning algorithms.

I'm sorting text files, and do as follows:

classifier = Pipeline([
('vectorizer', CountVectorizer ()),
('TFIDF', TfidfTransformer ()),
('clf', OneVsRestClassifier (GaussianNB()))])
classifier.fit(X_train,Y)
predicted = classifier.predict(X_test)

When I use the algorithm GaussianNB the following error occurs:

TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray () to convert to a dense numpy array.

I saw the following post here

In this post a class is created to perform the transformation of the data. It is possible to adapt my code with TfidfTransformer. How I can fix this?

AvidLearner · Accepted Answer · 2015-07-05T08:52:34.163

You can do the following:

class DenseTransformer(TransformerMixin):
    def transform(self, X, y=None, **fit_params):
        return X.todense()

    def fit_transform(self, X, y=None, **fit_params):
        self.fit(X, y, **fit_params)
        return self.transform(X)

    def fit(self, X, y=None, **fit_params):
        return self

classifier = Pipeline([
('vectorizer', CountVectorizer ()),
('TFIDF', TfidfTransformer ()),
('to_dense', DenseTransformer()), 
('clf', OneVsRestClassifier (GaussianNB()))])
classifier.fit(X_train,Y)
predicted = classifier.predict(X_test)

Now, as a part of your pipeline, the data will be transform to dense representation.

BTW, I don't know your constraints, but maybe you can use another classifier, such as RandomForestClassifier or SVM that DO accept data in sparse representation.

Thank you very much. Yes. I've tried other as SVM algorithm and RandomForest and have accepted a parse representation — Blunt, Jul 05 '15 at 09:44

Scikit-learn's Pipeline: Error with multilabel classification. A sparse matrix was passed

1 Answers1

Linked