I am trying to apply a new pre-processing algorithm to my dataset, following this answer: Encoding text in ML classifier
What I have tried now is the following:
def test_tfidf(data, ngrams = 1):
df_temp = data.copy(deep = True)
df_temp = basic_preprocessing(df_temp)
tfidf_vectorizer = TfidfVectorizer(ngram_range=(1, ngrams))
tfidf_vectorizer.fit(df_temp['Text'])
list_corpus = df_temp["Text"].tolist()
list_labels = df_temp["Label"].tolist()
X = tfidf_vectorizer.transform(list_corpus)
return X, list_labels
(I would suggest to refer to the link I mentioned above for all the code). When I try to apply the latter two function to my dataset:
train_x, train_y, count_vectorizer = tfidf(undersample_train, ngrams = 1)
testing_set = pd.concat([X_test, y_test], axis=1)
test_x, test_y = test_tfidf(testing_set, ngrams = 1)
full_result = full_result.append(training_naive(train_x, test_x, train_y, test_y), ignore_index = True)
I get this error:
---> 12 full_result = full_result.append(training_naive(train_x, test_x, train_y, test_y, ), ignore_index = True)
---> 14 y_pred = clf.predict(X_test_naive)
ValueError: dimension mismatch
The function mentioned in the error is:
def training_naive(X_train_naive, X_test_naive, y_train_naive, y_test_naive, preproc):
clf = MultinomialNB()
clf.fit(X_train_naive, y_train_naive)
y_pred = clf.predict(X_test_naive)
return
Any help in understanding what is wrong in my new definition and/or in applying the tf-idf to my dataset (please refer here for the relevant parts: Encoding text in ML classifier), it would be appreciated.
Update: I think this question/answer might be useful as well for helping me in figure out the issue: scikit-learn ValueError: dimension mismatch
if I replace test_x, test_y = test_tfidf(testing_set, ngrams = 1)
with test_x, test_y = test_tfidf(undersample_train, ngrams = 1)
it does not return any error. However, I do not think it is right, as I am getting values very very high (99% on all statistics)