2

I am implementing different classifiers using different machine learning algorithms.

I'm sorting text files, and do as follows:

classifier = Pipeline([
('vectorizer', CountVectorizer ()),
('TFIDF', TfidfTransformer ()),
('clf', OneVsRestClassifier (GaussianNB()))])
classifier.fit(X_train,Y)
predicted = classifier.predict(X_test)

When I use the algorithm GaussianNB the following error occurs:

TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray () to convert to a dense numpy array.

I saw the following post here

In this post a class is created to perform the transformation of the data. It is possible to adapt my code with TfidfTransformer. How I can fix this?

Community
  • 1
  • 1
Blunt
  • 529
  • 1
  • 8
  • 14

1 Answers1

3

You can do the following:

class DenseTransformer(TransformerMixin):
    def transform(self, X, y=None, **fit_params):
        return X.todense()

    def fit_transform(self, X, y=None, **fit_params):
        self.fit(X, y, **fit_params)
        return self.transform(X)

    def fit(self, X, y=None, **fit_params):
        return self

classifier = Pipeline([
('vectorizer', CountVectorizer ()),
('TFIDF', TfidfTransformer ()),
('to_dense', DenseTransformer()), 
('clf', OneVsRestClassifier (GaussianNB()))])
classifier.fit(X_train,Y)
predicted = classifier.predict(X_test)

Now, as a part of your pipeline, the data will be transform to dense representation.

BTW, I don't know your constraints, but maybe you can use another classifier, such as RandomForestClassifier or SVM that DO accept data in sparse representation.

AvidLearner
  • 4,123
  • 5
  • 35
  • 48
  • Thank you very much. Yes. I've tried other as SVM algorithm and RandomForest and have accepted a parse representation – Blunt Jul 05 '15 at 09:44