-1

i build a NLP classifier based on Naive base using python scikit-learn

the point is that , I want my classifier to classify a new text " that is not belongs to any of my training or testing data set"

in another model"like regression" , I can extract the Theta's values so that i can predict any new value.

however i know that,naive based is working by calculation the probability of each word to against every class .

for example

my data set include (1000 record of some text) as " it was so good " " i like it " " i don't like this movie " etc ..

and each text is classified as either +ev or -ev

i do separation to my data set into training and testing set. every thing is ok .

now i want to classify a brand new text like " Oh, I like this movie and the sound track was perfect"

how to make my model predict this text !

here is the code

from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features=850)

X = cv.fit_transform(corpus).toarray()
y = dataset.iloc[:, 1].values

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 10)

from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
classifier.fit(X_train, y_train)

y_pred = classifier.predict()

from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)

now iam expecting to do some kind new text like "good movie and nice sound track" and "acting was so bad". and let my classifier predict was it good or bad !

Xnew = [["good movie and nice sound track"], ["acting was so bad"]] ynew = classifier.predict(Xnew)

but i get a super error 

 jointi = np.log(self.class_prior_[i])
    436             n_ij = - 0.5 * np.sum(np.log(2. * np.pi * self.sigma_[i, :]))
--> 437             n_ij -= 0.5 * np.sum(((X - self.theta_[i, :]) ** 2) /
    438                                  (self.sigma_[i, :]), 1)
    439             joint_log_likelihood.append(jointi + n_ij)

TypeError: ufunc 'subtract' did not contain a loop with signature matching types dtype('<U32') dtype('<U32') dtype('<U32')

also I wonder if i can get all the probability for each word in my NLP Bag of my corpus.

thank's in advance

1 Answers1

0

You have to vectorize your comments before passing it to the model.

docs_new = ["good movie and nice sound track", "acting was so bad"]
X_new_counts = cv.transform(docs_new)
classifier.predict(X_new_counts)

To get probability scores

classifier.predict_proba(X_new_counts)

Alternatively, you can use sklearn's pipeline to combine these two steps

Prasanth Regupathy
  • 882
  • 10
  • 11