0

This is the data i use count vectorizer and tfidftransformer and also use GaussianNB but i get error in this code. Please let me know the correct syntax.

train = [('I love this sandwich.','pos'),
     ('This is an amazing place!', 'pos'),
     ('I feel very good about these beers.', 'pos'),
     ('This is my best work.', 'pos'),
     ('What an awesome view', 'pos'),
     ('I do not like this restaurant', 'neg'),
     ('I am tired of this stuff.', 'neg'),
     ("I can't deal with this.", 'neg'),
     ('He is my sworn enemy!.', 'neg'),
     ('My boss is horrible.', 'neg')
    ]
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()

text_train_cv = cv.fit_transform(list(zip(*train))[0])
print(text_train_cv.toarray())

from sklearn.feature_extraction.text import TfidfTransformer
tfidf_trans = TfidfTransformer()

text_train_tfidf = tfidf_trans.fit_transform(text_train_cv)
print(text_train_tfidf.toarray())

from sklearn.naive_bayes import GaussianNB
clf = GaussianNB().fit(text_train_tfidf.toarray(), list(zip(*train))[1])

text_clf = Pipeline([('vect',CountVectorizer(stop_words='english')), 
('tfidf',TfidfTransformer()),('clf',GaussianNB(priors=None))])
text_clf = text_clf.fit(text_train_tfidf.toarray() , list(zip(*train))[1])
print(text_clf)

It give me error: AttributeError: 'numpy.ndarray' object has no attribute 'lower'

2 Answers2

1

Do

clf = GaussianNB().fit(text_train_tfidf.toarray() , list(zip(*train))[1])

The GaussianNB doesnt support sparse matrices as input for X, but the TfidfTransformer will by default return a sparse matrix. Hence the error.

toarray() will convert that to dense. But note that it will lead to a high increase in memory usage.

Update:

When using a pipeline, you need to supply the data which you passed to transformer in the pipeline. In this case that is list(zip(*train))[0].

text_clf = text_clf.fit(list(zip(*train))[0] , list(zip(*train))[1])

That will solve your first error. But you will still get an error because of sparse matrix. See this answer for solving that :- https://stackoverflow.com/a/28384887/3374996

Vivek Kumar
  • 35,217
  • 8
  • 109
  • 132
  • @SikandarAli, if it helped, please accept the answer. – Vivek Kumar Apr 16 '18 at 10:32
  • sir if i use pipeline then it give me error[AttributeError: 'numpy.ndarray' object has no attribute 'lower'] please help me out of it. – Sikandar Ali Apr 16 '18 at 10:36
  • @SikandarAli please edit the question to add the code of the pipeline/ – Vivek Kumar Apr 16 '18 at 10:39
  • sir let me know the syntax. – Sikandar Ali Apr 16 '18 at 10:46
  • sir i already apply that syntax. it need dense data to pass. So should i use dense transformer along with other to work? – Sikandar Ali Apr 16 '18 at 10:58
  • @SikandarAli Yes. You need to put that between tfidftransformer and GaussianNB. Or else you can use a different estimator in place of GaussianNB. Like MultinomialNB as MaxU suggested in other answer here. In that you will not need to add that extra transformer. – Vivek Kumar Apr 16 '18 at 11:00
  • Sir thank u :) I already use the MultinomialNB and get the result but i also want to get the results from other classifier so i try that. – Sikandar Ali Apr 16 '18 at 11:03
1

MultinomialNB is used very often for text classification tasks and it does support sparse matrices as an input data set.

PS using dense matrices for bigger corpuses you might end up with the MemoryError

So try this:

from sklearn.naive_bayes import MultinomialNB

clf = MultinomialNB().fit(text_train_tfidf , list(zip(*train))[1])
MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419