i build a NLP classifier based on Naive base using python scikit-learn
the point is that , I want my classifier to classify a new text " that is not belongs to any of my training or testing data set"
in another model"like regression" , I can extract the Theta's values so that i can predict any new value.
however i know that,naive based is working by calculation the probability of each word to against every class .
for example
my data set include (1000 record of some text) as " it was so good " " i like it " " i don't like this movie " etc ..
and each text is classified as either +ev or -ev
i do separation to my data set into training and testing set. every thing is ok .
now i want to classify a brand new text like " Oh, I like this movie and the sound track was perfect"
how to make my model predict this text !
here is the code
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features=850)
X = cv.fit_transform(corpus).toarray()
y = dataset.iloc[:, 1].values
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 10)
from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
classifier.fit(X_train, y_train)
y_pred = classifier.predict()
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
now iam expecting to do some kind new text like "good movie and nice sound track" and "acting was so bad". and let my classifier predict was it good or bad !
Xnew = [["good movie and nice sound track"], ["acting was so bad"]] ynew = classifier.predict(Xnew)
but i get a super error
jointi = np.log(self.class_prior_[i])
436 n_ij = - 0.5 * np.sum(np.log(2. * np.pi * self.sigma_[i, :]))
--> 437 n_ij -= 0.5 * np.sum(((X - self.theta_[i, :]) ** 2) /
438 (self.sigma_[i, :]), 1)
439 joint_log_likelihood.append(jointi + n_ij)
TypeError: ufunc 'subtract' did not contain a loop with signature matching types dtype('<U32') dtype('<U32') dtype('<U32')
also I wonder if i can get all the probability for each word in my NLP Bag of my corpus.
thank's in advance